Archive for the ‘PostgreSQL’ Category

Tweaking PostgreSQL Performance

Saturday, March 15th, 2008

seq_page_cost:

This is set conservatively by default. Setting it to higher values will cause the planner to favor index scans over sequential scans, but if you set it too high, it will degrade performance for some queries. For most servers, a good value is between 2.0 and 8.0 (note, this assumes you leave random_page_cost alone), but if you have a huge query somewhere, it may be a good idea to go higher. For really big queries, with ~4+ JOINs and multiple WHERE clauses, it’s not uncommon to see performance increase several times by upping this value. Also, if you have a lot of ram (ie. your database fits in ram), setting random_page_cost and seq_page_cost to the same value can increase performance.

effective_cache_size:

This variable controls how aggressively the planner utilizes disk. High values will cause the planner to use disk much more aggressively, under the assumption that your operating system is caching it, while lower values cause it to behave much more conservatively for disk-expensive tasks. For a dedicated database server, a good policy is to give it most of your memory. For 8GB, give it 6GB, for 4GB, give it 2-3GB, etc. Linux is pretty aggressive at using disk cache, so it’s not uncommon for it to use 80+% of your memory for cache. Note that this mainly impacts large datasets, so most queries are unaffected by this. Also, setting this to high values usually doesn’t hurt performance.

shared_buffers:

This controls how much memory is shared between PostgreSQL processes. Unlike what you might think, setting this to a high value degrades performance. Read the documentation on how this works, increasing variables like max_connections will require you to increase shared_buffers. Anywhere between 64MB and 256MB should be plenty. Also, keeping this as low as possible may improve checkpoint performance, as higher values will cause PostgreSQL to keep more data in memory before writing to disk, so it creates more overhead (note, this is no longer the case in 8.3). Also note that you will need to increase the max shared memory allowed by you OS to increase shared_buffers. This is done by adding kernel.shmmax = <number of kilobytes> into /etc/sysctl.conf, then running `sudo /sbin/sysctl -p /etc/sysctl.conf`.

Managing Row Counts Using Triggers

Sunday, March 9th, 2008

It seems like this is a subject that I keep coming back to. Every situation is unique, and demands it’s own solution; so far, there’s no one-size-fits all solution for working around COUNT(*) in PostgreSQL. It some situations, there is no need for a workaround, and the performance hit of COUNT(*) is neglectable (this is generally true if you have under 2000 rows, but it mainly depends on the server).

This time, the problem I ran into is how to maintain the row count for how many friends a person has. Let’s say we have 2000 users, and each user has 30 friends. That would be 60,000 rows, and out of the range of what COUNT(*) can handle. The solution here is to create a count index table, that tracks how many friends each user has. The friends table looks something like this:

CREATE TABLE friends (
    your_id BIGINT NOT NULL,
    friend_id BIGINT NOT NULL,

    CONSTRAINT friends_chk_your_id FOREIGN KEY
        (your_id) REFERENCES members (id) DEFERRABLE,
    CONSTRAINT friends_chk_friend_id FOREIGN KEY
        (friend_id) REFERENCES members (id) DEFERRABLE
);

CREATE INDEX friends_your_id_idx ON friends (your_id);
CREATE INDEX friends_friend_id_idx ON friends (friend_id);
CREATE UNIQUE INDEX friends_unq_id_idx ON friends (your_id, friend_id);

Nothing special, just your member id, and your friends member id, with a unique index to make sure there’s no duplicate entries, and some foreign keys to make sure the user id’s are valid. I want to keep track of how many friends a user has, so that would be a simple two column table:

CREATE TABLE friends_count_idx (
    num_rows INTEGER NOT NULL,
    your_id BIGINT NOT NULL,

    PRIMARY KEY(your_id),
    CONSTRAINT friends_count_idx_chkcount CHECK (num_rows >= 0)
);

CREATE INDEX friends_count_idx_num_rows ON friends_count_idx (num_rows);

Pretty simple, right? A primary key for your_id, and a check constraint to make sure you don’t go under 0 rows. Now here’s where the fun begins.

I want a function to initialize the friends_count_idx table, as well as clean it up if something ever happens and it starts to drift. So basically, we’re looking at a maintenance function that deletes any existing records, as well as inserts records based on what’s in friends:

CREATE OR REPLACE FUNCTION friends_count_idx_materialize ()
RETURNS VARCHAR AS $$
    DECLARE
        rows INTEGER;
    BEGIN
        -- Delete old records
        DELETE FROM friends_count_idx;

        INSERT INTO friends_count_idx (num_rows, your_id)
        SELECT COUNT(*), your_id FROM friends
            GROUP BY your_id;

        GET DIAGNOSTICS rows = ROW_COUNT;

        RETURN rows || ' Rows Processed';
    END;
$$ LANGUAGE plpgsql;

SELECT friends_count_idx_materialize();
--------------------
53 Rows Processed
(1 row)

SELECT * FROM friends_count_idx LIMIT 1;
num_rows | your_id
----------+---------
23 |      14
(1 row)

Now we’re getting somewhere… But I also need a set of triggers that handle inserts and deletes. They also need to be able to auto-create a record in friends_count_idx for new users, who are adding their first friend. Still, nothing too hard, but time consuming to write and debug:

CREATE OR REPLACE FUNCTION friends_count_idx_insert ()
RETURNS TRIGGER AS $$
    DECLARE
        r RECORD;
    BEGIN
        SELECT INTO r num_rows FROM friends_count_idx
            WHERE your_id = NEW.your_id;

        -- Auto Create Record
        IF NOT FOUND THEN
            INSERT INTO friends_count_idx (num_rows, your_id)
            SELECT COUNT(*), your_id FROM friends
            WHERE your_id = NEW.your_id GROUP BY your_id;

            RETURN NEW;
        END IF;

        -- Update Existing Record
        UPDATE friends_count_idx SET num_rows = (r.num_rows + 1)
            WHERE your_id = NEW.your_id;

        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION friends_count_idx_delete ()
RETURNS TRIGGER AS $$
    DECLARE
        r RECORD;
        rows INTEGER;
    BEGIN
        SELECT INTO r num_rows FROM friends_count_idx
            WHERE your_id = OLD.your_id;

        -- Auto Create Record
        IF NOT FOUND THEN
            INSERT INTO friends_count_idx (num_rows, your_id)
            SELECT COUNT(*), your_id FROM friends
            WHERE your_id = OLD.your_id GROUP BY your_id;

            RETURN OLD;
        END IF;

        rows = r.num_rows - 1;

        -- Check for row count drift
        IF rows <= 0 THEN
            RETURN OLD;
        END IF;

        -- Update Existing Record
        UPDATE friends_count_idx SET num_rows = (r.num_rows - 1)
        WHERE your_id = OLD.your_id;

        RETURN OLD;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER friends_count_idx_insert AFTER INSERT ON friends
FOR EACH ROW EXECUTE PROCEDURE friends_count_idx_insert();

CREATE TRIGGER friends_count_idx_delete AFTER DELETE ON friends
FOR EACH ROW EXECUTE PROCEDURE friends_count_idx_delete();

SELECT * FROM friends_count_idx WHERE your_id = 3;

num_rows | your_id
----------+---------
3 |       3
(1 row)

INSERT INTO friends (your_id, friend_id)  VALUES (3, 7);

SELECT * FROM friends_count_idx WHERE your_id = 3;

num_rows | your_id
----------+---------
4 |       3
(1 row)
DELETE FROM friends WHERE your_id = 3 AND friend_id = 7;

SELECT * FROM friends_count_idx WHERE your_id = 3;

num_rows | your_id
----------+---------
3 |       3
(1 row)

So, it’s running smoothly. It could still probably use some more debugging, just to make sure every possible situation is covered, but the skeleton is working fine.

Neat Trick: Stats For Open PostgreSQL Connections

Sunday, March 9th, 2008

Useful for debugging sites that are hogging connections, especially if you have persistent connections on.

SELECT
    datname,
    COUNT(*) AS open_connections,
    MAX(backend_start) AS oldest_connection,
    MIN(backend_start) AS newest_connection
    FROM pg_stat_activity GROUP BY datname
UNION SELECT
    'Summary',
    COUNT(*),
    MAX(backend_start),
    MIN(backend_start)
    FROM pg_stat_activity;

Sorry, I accidentally deleted this post. Here it is again.

PostgreSQL 8.1 ORDER BY bug

Thursday, March 6th, 2008

There seems to be a bug in the way PostgreSQL 8.1 handles ORDER BY when you format a timestamp. An example is:

SELECT to_char(date, ‘MM-DD-YYYY’) AS date FROM table ORDER BY date ASC;

In this case, you can get bad results, as 1-1-2009 will be placed before 3-3-2008. This doesn’t appear to affect versions after 8.1. The workaround is to specify the table name before the date, ie:

SELECT to_char(date, ‘MM-DD-YYYY’) AS date FROM table ORDER BY table.date ASC;

UPDATE: It appears that PostgreSQL 8.1 actually handles it correctly, and 8.2 handles it improperly. Also note that ORDER BY works the same in 8.3 as it does in 8.1.

PostgreSQL Tips Of The Day

Sunday, March 2nd, 2008

1. You can do UPDATE queries, using syntax similar to INSERT:

UPDATE table SET (column1, column2) = (’foo’, ‘bar’);

This can make your code much easier to manage when creating queries in PHP:

$vals = array(’column1′ => $db->quote(’foo’), ‘column2′ => $db->quote(’bar’));

$db->query(sprintf(’UPDATE table SET (%s) = (%s) WHERE id = 1′),

join(’, ‘, array_keys($vals)), join(’, ‘, $vals));

2. Timestamps, intervals, and dates can easily be formatted using the to_char function:

SELECT to_char(NOW(), ‘MM-DD-YYYY’);

to_char
————
03-02-2008
(1 row)

3. You can add and subtract with date/time data types:

SELECT NOW() - ‘10 minutes’::interval;

?column?
——————————-
2008-03-02 10:20:32.849205-08
(1 row)

SELECT current_date - ‘1 week’::interval;

?column?
———————
2008-02-24 00:00:00
(1 row)

4. The age function comes in very handy. It’s used to calculate the age between intervals, timestamps, and dates:

SELECT age(NOW(), NOW() - ‘10 minutes’::interval);

age
———-
00:10:00
(1 row)

5. You can achieve IF/ELSE-like functionality using CASE WHEN/ELSE/END:

SELECT CASE WHEN age(NOW(), NOW() - ‘17 years’::interval) >= 18 THEN ‘you are older than 18′ ELSE ‘you are younger than 18′ END;

case
————————-
you are younger than 18
(1 row)

PHP And pg_connect: Post Mortem

Sunday, February 24th, 2008

In my previous post, I tried to find out why a PHP script was so slow. Here’s an update from then.

Basically, pg_connect is slow. There’s probably a lot of PHP overhead to connect to a database plus it probably expects to be opening a fresh connection everytime, as using pooled connections doesn’t seem to speed it up very much (it goes from ~50 requests a second to ~200). Persistent connections seem to be much faster, albeit much slower than connecting to a pool with Python (450 requests a second compared to 1000), however, if you understand the inner workings of persistent connections you will know that using them with PostgreSQL is seriously broken, as a small number of users connecting to different sites on the same server can easily exhaust max_connections. Using pg_pconnect to connect to a connection pool doesn’t work either; it’s like combining the worst of both: the slowness of pg_connect, and the broken-ness of pg_pconnect.

So there’s pretty much no solution to this problem at this point in time, except maybe to port the script to Python.

humongous professional douchebag

PHP Is Slow… Really Slow

Friday, February 22nd, 2008

Lately, I’ve been trying to figure out what’s slowing down a site that’s currently in development… It didn’t take long to figure out what it was.

There’s a script on the server that handles image requests transparently, ie, utilizing mod_rewrite. Basically what it does is this:

  1. Connect to database.
  2. Query the database for the meta info of the requested file (created, modified, filetype, etc).
  3. Check if the client has a cached copy of the image, and if it does, send a 304: Not Modified, an ETag, and return.
  4. If not, grab the file from server-side cache if it exists, send it, and return.
  5. Else, fetch the image from the database, cache it locally on disk, and send the data.

99% of the time, all it does it grab the meta data, and send a 304 Not Modified, yet, it is extremely slow, so I decided to do a quick benchmark of PHP and Python, comparing how fast they deliver a small image (about 10kb) from a PostgreSQL database… The results were shocking.

The setup is very modest, an Athlon 64 3200+, 2GB ram, and a 500GB 7200.10 SATA II hard drive. The test is very simple: connect to the database, of which the connections are pooled by pgpool, fetch and send the image, then close the database connection.

The code used is as followed.

PHP:

// Note it's port 5433. That's the port that pgpool listens on.

$con = pg_connect("dbname=testdb user=testuser host=127.0.0.1"
    . " password=password port=5433");

$res = pg_query("SELECT data, filetype FROM images WHERE id = 1");
$res = pg_fetch_assoc($res);

header('Content-Type: ' . $res['filetype']);
echo pg_unescape_bytea($res['data']);

pg_close($con);

And now Python:

import psycopg2
from mod_python import apache

def handler(req):
    con = psycopg2.connect("dbname='testdb' user='testuser'"
        + " host='127.0.0.1' password='password' port='5433'")
    cur = con.cursor()

    cur.execute("SELECT data, filetype FROM images WHERE id = 1")
    row = cur.fetchone()
    con.close()

    req.content_type = row[1]
    req.write(row[0])
    return apache.OK

Pretty simple, right? I ran the test using ab, or Apache Benchmark, with the command:

ab -c 10 -t 15 http://localhost/<scriptname>

I ran the test three times for each script, here’s the results:

PHP - Maximum Requests/Second: 145.32
Python - Maximum Requests/Second: 1088.64
Apache2 Static Image - Max Requests/Second - 2778.54

WOW! Python is nearly 8 times faster than PHP in this test! I was so intrigued, I decided to dig deeper to see exactly what the bottleneck is in PHP.

I created a new test: Connect to the database, and close the connection immediately, don’t send any data whatsoever.

Here’s the results from that:

PHP - Max Requests/Second: 205.00
Python - Max Requests/Second: 1089.32

So PHP gains ~50 requests/second, and Python has no improvement at all. But what if we just leave the scripts blank? Basically just return and do nothing at all? Here’s the results from that:

PHP - Max Requests/Second: 1566.77
Python - Max Requests/Second: 1819.46

So basically, PHP sucks, at least when it comes to PostgreSQL connections. What’s strange is PHP doesn’t seem to have the same problem with MySQL connections, where by simply connecting to the database, performance drops significantly, so something is certainly wrong with the way PHP handles PostgreSQL. I bet there’s a lot of “MySQL vs. PostgreSQL” benchmarks out there that wield wildly inaccurate results because of this. Actually, if you compare MySQL and PostgreSQL purely in Python, you’ll find that PostgreSQL is quite a bit faster at handling binary data, such as images. I will be back if I find more information or a fix for this behavior.

UPDATE: it looks like the problem lies specifically with the pg_connect function. This problem doesn’t exists with pg_pconnect, however, you cannot connect to a database pool like pgpool/pgbounder with pg_pconnect, as that degrades performance.