Archive for March, 2008

Constraints In PostgreSQL

Friday, March 28th, 2008

After a very long time, I decided it was time to rewrite a script I created so I could move databases from one version of PostgreSQL to another. For instance, I could move a PostgreSQL 8.3 database to any version on PostgreSQL after 7.3. The problem was, I never got around to implementing constraints in it. Hopefully someone can find this info useful.

Constraint information is held in the pg_constraint table. The column, conrelid, defines the relation id of the table the constraint is on, and it corresponds to the relfilenode column in pg_class. This column is used for the join.

Constraint type is held in the contype column of the pg_constraint table. Here’s a map of the different types:

  • f = Foreign Key
  • p = Primary Key
  • u = Unique
  • c = Check Constraint

The consrc column holds information about check constraints, so for instance, we can do:

SELECT
    relname,
    conname,
    consrc
    FROM pg_constraint, pg_class
    WHERE conrelid = relfilenode AND contype = 'c';

relname     |  conname                     | consrc
---------------------------------------------------------------------
ratings      | ratings_chk_rating    | ((rating >= 1) AND (rating <= 5))

(1 row)

Primary Keys and Unique constraints are a bit more complex, as you have to find the column name as well. You can do something like this (Note: this won’t work for multi-column primary/unique keys):

SELECT
    relname,
    attname,
    conname,
    CASE WHEN contype = 'p' THEN 'primary' ELSE 'unique' END AS type
    FROM pg_constraint
        INNER JOIN pg_class ON (conrelid = relfilenode)
        INNER JOIN pg_attribute ON (conrelid = attrelid AND conkey[1] = attnum)
    WHERE contype = 'u' OR contype = 'p';

Will return something like:

relname | attname | conname       | type
--------------------------------------------
stores    | id             | stores_pkey | primary

Foreign Keys are even more complex, as you have to join pg_attribute and pg_class twice:

SELECT
    c1.relname AS local_table,
    c2.relname AS foreign_table,
    conname,
    a1.attname AS local_key,
    a2.attname AS foreign_key,
    condeferrable
    FROM pg_constraint
        INNER JOIN pg_class c1 ON (conrelid = c1.relfilenode)
        INNER JOIN pg_class c2 ON (confrelid = c2.relfilenode)
        INNER JOIN pg_attribute a1 ON (conrelid = a1.attrelid AND conkey[1] = a1.attnum)
        INNER JOIN pg_attribute a2 ON (confrelid = a2.attrelid AND confkey[1] = a2.attnum)
    WHERE contype = 'f';

Programming Language Rundown

Friday, March 21st, 2008

Ruby

A lot of hype, useless in the real world.

Python

A bunch of self-punishing nuns who whip themselves with chain every time they don’t follow strict syntax standards.

Perl

Used to write scripts that automate server functions, by people who don’t realize you can do the exact same things with a bash script in 1/5th the time and with half the bugs. Also: impossible to read.

ColdFusion

The bad guy in the movie that never seems to die.

PHP

Written by the largest group of schizophrenics with obsessive-compulsive disorder. They try to keep backwards compatibility where it’s not needed, and remove it where it is. Also: has had more API changes than Michael Jackson has had nose jobs.

LOLCODE

The best language ever written. There’s even PL/LOLCODE for PostgreSQL.

HAI

CAN HAS STDIO?

VISIBLE “HAI WORLD!”

KTHXBYE

Will Maria Live Up To The Hype?

Friday, March 21st, 2008

Monty recently announced the Maria engine for MySQL, promising MVCC, better memory management, crash-safe design, and ACID compliance among other things. But will it live up to the hype? I’ve been disappointed in the past, mainly with InnoDB, so I’m understandably a bit skeptical that this will be a major leap for MySQL. I would much rather see better datatype integration, and things like dynamically generated default values fixed (try setting a default value to CURRENT_DATE(), you’ll see what I mean).

My biggest beef with MySQL is they never seem to fix things that do not work as they should. For example, if you do BEGIN; CREATE TABLE; ROLLBACK;, the table will still exist after the ROLLBACK command, and also, constraints are often completely ignored.

Improving Dynamic Images [Theoretically]

Thursday, March 20th, 2008

Something I’ve been thinking about lately, is how to improve the performance of images served from a database. The model I’ve deployed in the past, looks like this:

  • Apache Rewrites all requests to /Images to /image-script.php?image=$1
  • The script then includes all base files, some not necessary, and performs a sanity check.
  • The script grabs the meta-info from the database, for the requested file.
  • An ETag, as well as other HTTP headers, such as Cache-Control, Content-Disposition, etc, are generated.
  • The script checks for a cached copy of the data (using Cache_Lite), and if it’s not found or it’s expired, it fetches the image data from the database, and caches it.
  • Either the headers and data is sent, or a 304 Not Modified is sent.

The flaw with this design: things are included, but not used, and the database is touched for every request. The new design that I’m thinking about would be a very minimalistic class, and it would only access the database if the cache doesn’t exist, or if it’s expired. It would look more like this:

  • Class is initialize, and sanitizes all input.
  • When called upon, the class checks the cache for the serialized database result, and unserialize()’s it if it’s found.
  • If not found, it opens the database, fetches the data, then caches the serialize()’d result.
  • Either a 304 Not Modified is then sent, or the headers are generated and sent along with the data.
  • The backend administration panel clears the cache [per virtual file] when UPDATEs or DELETEs occur. This is to avoid having an updated image not hit live until the cache is expired (the previous model used the meta info from the database to find out if this happened or not).

My initial tests show that I can get around 200-500 requests a second on a real world server by using this model. Not too bad in my opinion.

Bashing VACUUM? Now What’s Wrong With This Picture?

Thursday, March 20th, 2008

One thing I get sick of is MySQL users bashing PostgreSQL for having the VACUUM command. They basically say the usual: VACUUM is slow, it’s flawed design, why should I have to VACUUM, etc. This, despite the fact that MySQL has a command that does almost the exact same thing, the difference of course being that PostgreSQL also collects stats from the database to optimize how the planner functions.

VACUUM, of course, is slow and uses a lot of CPU/Disk, but seriously, what’s stopping you from setting up a cron script that does `echo “VACUUM ANALYZE” | nice -n 19 psql database_name`, every night at midnight? I have a ~300MB database, and it takes all of three seconds to VACUUM ANALYZE. The only problem would be if you have a humongous database, or a slow server.

Tweaking PostgreSQL Performance

Saturday, March 15th, 2008

seq_page_cost:

This is set conservatively by default. Setting it to higher values will cause the planner to favor index scans over sequential scans, but if you set it too high, it will degrade performance for some queries. For most servers, a good value is between 2.0 and 8.0 (note, this assumes you leave random_page_cost alone), but if you have a huge query somewhere, it may be a good idea to go higher. For really big queries, with ~4+ JOINs and multiple WHERE clauses, it’s not uncommon to see performance increase several times by upping this value. Also, if you have a lot of ram (ie. your database fits in ram), setting random_page_cost and seq_page_cost to the same value can increase performance.

effective_cache_size:

This variable controls how aggressively the planner utilizes disk. High values will cause the planner to use disk much more aggressively, under the assumption that your operating system is caching it, while lower values cause it to behave much more conservatively for disk-expensive tasks. For a dedicated database server, a good policy is to give it most of your memory. For 8GB, give it 6GB, for 4GB, give it 2-3GB, etc. Linux is pretty aggressive at using disk cache, so it’s not uncommon for it to use 80+% of your memory for cache. Note that this mainly impacts large datasets, so most queries are unaffected by this. Also, setting this to high values usually doesn’t hurt performance.

shared_buffers:

This controls how much memory is shared between PostgreSQL processes. Unlike what you might think, setting this to a high value degrades performance. Read the documentation on how this works, increasing variables like max_connections will require you to increase shared_buffers. Anywhere between 64MB and 256MB should be plenty. Also, keeping this as low as possible may improve checkpoint performance, as higher values will cause PostgreSQL to keep more data in memory before writing to disk, so it creates more overhead (note, this is no longer the case in 8.3). Also note that you will need to increase the max shared memory allowed by you OS to increase shared_buffers. This is done by adding kernel.shmmax = <number of kilobytes> into /etc/sysctl.conf, then running `sudo /sbin/sysctl -p /etc/sysctl.conf`.

Icculus.org Quake2 On 64bit Linux

Monday, March 10th, 2008

Here’s some patches to get Quake2 running on 64bit Linux. It should apply cleanly to quake2-r0.16.1.

Quake2 64bit Fix

Rogue Expansion Fix

Xatrix Expansion Fix

Managing Row Counts Using Triggers

Sunday, March 9th, 2008

It seems like this is a subject that I keep coming back to. Every situation is unique, and demands it’s own solution; so far, there’s no one-size-fits all solution for working around COUNT(*) in PostgreSQL. It some situations, there is no need for a workaround, and the performance hit of COUNT(*) is neglectable (this is generally true if you have under 2000 rows, but it mainly depends on the server).

This time, the problem I ran into is how to maintain the row count for how many friends a person has. Let’s say we have 2000 users, and each user has 30 friends. That would be 60,000 rows, and out of the range of what COUNT(*) can handle. The solution here is to create a count index table, that tracks how many friends each user has. The friends table looks something like this:

CREATE TABLE friends (
    your_id BIGINT NOT NULL,
    friend_id BIGINT NOT NULL,

    CONSTRAINT friends_chk_your_id FOREIGN KEY
        (your_id) REFERENCES members (id) DEFERRABLE,
    CONSTRAINT friends_chk_friend_id FOREIGN KEY
        (friend_id) REFERENCES members (id) DEFERRABLE
);

CREATE INDEX friends_your_id_idx ON friends (your_id);
CREATE INDEX friends_friend_id_idx ON friends (friend_id);
CREATE UNIQUE INDEX friends_unq_id_idx ON friends (your_id, friend_id);

Nothing special, just your member id, and your friends member id, with a unique index to make sure there’s no duplicate entries, and some foreign keys to make sure the user id’s are valid. I want to keep track of how many friends a user has, so that would be a simple two column table:

CREATE TABLE friends_count_idx (
    num_rows INTEGER NOT NULL,
    your_id BIGINT NOT NULL,

    PRIMARY KEY(your_id),
    CONSTRAINT friends_count_idx_chkcount CHECK (num_rows >= 0)
);

CREATE INDEX friends_count_idx_num_rows ON friends_count_idx (num_rows);

Pretty simple, right? A primary key for your_id, and a check constraint to make sure you don’t go under 0 rows. Now here’s where the fun begins.

I want a function to initialize the friends_count_idx table, as well as clean it up if something ever happens and it starts to drift. So basically, we’re looking at a maintenance function that deletes any existing records, as well as inserts records based on what’s in friends:

CREATE OR REPLACE FUNCTION friends_count_idx_materialize ()
RETURNS VARCHAR AS $$
    DECLARE
        rows INTEGER;
    BEGIN
        -- Delete old records
        DELETE FROM friends_count_idx;

        INSERT INTO friends_count_idx (num_rows, your_id)
        SELECT COUNT(*), your_id FROM friends
            GROUP BY your_id;

        GET DIAGNOSTICS rows = ROW_COUNT;

        RETURN rows || ' Rows Processed';
    END;
$$ LANGUAGE plpgsql;

SELECT friends_count_idx_materialize();
--------------------
53 Rows Processed
(1 row)

SELECT * FROM friends_count_idx LIMIT 1;
num_rows | your_id
----------+---------
23 |      14
(1 row)

Now we’re getting somewhere… But I also need a set of triggers that handle inserts and deletes. They also need to be able to auto-create a record in friends_count_idx for new users, who are adding their first friend. Still, nothing too hard, but time consuming to write and debug:

CREATE OR REPLACE FUNCTION friends_count_idx_insert ()
RETURNS TRIGGER AS $$
    DECLARE
        r RECORD;
    BEGIN
        SELECT INTO r num_rows FROM friends_count_idx
            WHERE your_id = NEW.your_id;

        -- Auto Create Record
        IF NOT FOUND THEN
            INSERT INTO friends_count_idx (num_rows, your_id)
            SELECT COUNT(*), your_id FROM friends
            WHERE your_id = NEW.your_id GROUP BY your_id;

            RETURN NEW;
        END IF;

        -- Update Existing Record
        UPDATE friends_count_idx SET num_rows = (r.num_rows + 1)
            WHERE your_id = NEW.your_id;

        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION friends_count_idx_delete ()
RETURNS TRIGGER AS $$
    DECLARE
        r RECORD;
        rows INTEGER;
    BEGIN
        SELECT INTO r num_rows FROM friends_count_idx
            WHERE your_id = OLD.your_id;

        -- Auto Create Record
        IF NOT FOUND THEN
            INSERT INTO friends_count_idx (num_rows, your_id)
            SELECT COUNT(*), your_id FROM friends
            WHERE your_id = OLD.your_id GROUP BY your_id;

            RETURN OLD;
        END IF;

        rows = r.num_rows - 1;

        -- Check for row count drift
        IF rows <= 0 THEN
            RETURN OLD;
        END IF;

        -- Update Existing Record
        UPDATE friends_count_idx SET num_rows = (r.num_rows - 1)
        WHERE your_id = OLD.your_id;

        RETURN OLD;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER friends_count_idx_insert AFTER INSERT ON friends
FOR EACH ROW EXECUTE PROCEDURE friends_count_idx_insert();

CREATE TRIGGER friends_count_idx_delete AFTER DELETE ON friends
FOR EACH ROW EXECUTE PROCEDURE friends_count_idx_delete();

SELECT * FROM friends_count_idx WHERE your_id = 3;

num_rows | your_id
----------+---------
3 |       3
(1 row)

INSERT INTO friends (your_id, friend_id)  VALUES (3, 7);

SELECT * FROM friends_count_idx WHERE your_id = 3;

num_rows | your_id
----------+---------
4 |       3
(1 row)
DELETE FROM friends WHERE your_id = 3 AND friend_id = 7;

SELECT * FROM friends_count_idx WHERE your_id = 3;

num_rows | your_id
----------+---------
3 |       3
(1 row)

So, it’s running smoothly. It could still probably use some more debugging, just to make sure every possible situation is covered, but the skeleton is working fine.