Archive for the ‘Code’ Category

Python: Tips For Writing Daemons

Saturday, May 3rd, 2008

Here’s some useful information for writing daemons in Python.

One common problem that people run into is os.fork() producing zombie processes when children quit. This can easily be overcome by setting the SIGCHLD signal to SIG_IGN. ie:

import signal

signal.signal(signal.SIGCHLD, signal.SIG_IGN)

On some flavors of Unix, you are forced to do a double-fork on startup, in order to go into daemon mode. This is because single forking isn’t guaranteed to detach from the controlling terminal. The solution is this:

import os, sys, time

# Main loop
def main():
    # Drop privs.
    os.setgid(1000) # Replace with desired GID
    os.setuid(1000) # Replace with desired UID

    time.sleep(10)
    sys.exit(0)

if (not os.fork()):
    os.setsid() # Become session leader
    pid = os.fork()

    if (pid):
        # Parent, write PID file
        fp = open('/var/run/my-daemon.pid', 'w')
        fp.write(str(pid))
        fp.flush()

        # Forcibly sync disk
        os.fsync(fp.fileno())
        fp.close()

        os._exit(0)
    else:
        # Child, call main
        main()
else:
    # Parent
    os._exit(0)

PHP And pg_connect: Post Mortem

Sunday, February 24th, 2008

In my previous post, I tried to find out why a PHP script was so slow. Here’s an update from then.

Basically, pg_connect is slow. There’s probably a lot of PHP overhead to connect to a database plus it probably expects to be opening a fresh connection everytime, as using pooled connections doesn’t seem to speed it up very much (it goes from ~50 requests a second to ~200). Persistent connections seem to be much faster, albeit much slower than connecting to a pool with Python (450 requests a second compared to 1000), however, if you understand the inner workings of persistent connections you will know that using them with PostgreSQL is seriously broken, as a small number of users connecting to different sites on the same server can easily exhaust max_connections. Using pg_pconnect to connect to a connection pool doesn’t work either; it’s like combining the worst of both: the slowness of pg_connect, and the broken-ness of pg_pconnect.

So there’s pretty much no solution to this problem at this point in time, except maybe to port the script to Python.

humongous professional douchebag

PHP Is Slow… Really Slow

Friday, February 22nd, 2008

Lately, I’ve been trying to figure out what’s slowing down a site that’s currently in development… It didn’t take long to figure out what it was.

There’s a script on the server that handles image requests transparently, ie, utilizing mod_rewrite. Basically what it does is this:

  1. Connect to database.
  2. Query the database for the meta info of the requested file (created, modified, filetype, etc).
  3. Check if the client has a cached copy of the image, and if it does, send a 304: Not Modified, an ETag, and return.
  4. If not, grab the file from server-side cache if it exists, send it, and return.
  5. Else, fetch the image from the database, cache it locally on disk, and send the data.

99% of the time, all it does it grab the meta data, and send a 304 Not Modified, yet, it is extremely slow, so I decided to do a quick benchmark of PHP and Python, comparing how fast they deliver a small image (about 10kb) from a PostgreSQL database… The results were shocking.

The setup is very modest, an Athlon 64 3200+, 2GB ram, and a 500GB 7200.10 SATA II hard drive. The test is very simple: connect to the database, of which the connections are pooled by pgpool, fetch and send the image, then close the database connection.

The code used is as followed.

PHP:

// Note it's port 5433. That's the port that pgpool listens on.

$con = pg_connect("dbname=testdb user=testuser host=127.0.0.1"
    . " password=password port=5433");

$res = pg_query("SELECT data, filetype FROM images WHERE id = 1");
$res = pg_fetch_assoc($res);

header('Content-Type: ' . $res['filetype']);
echo pg_unescape_bytea($res['data']);

pg_close($con);

And now Python:

import psycopg2
from mod_python import apache

def handler(req):
    con = psycopg2.connect("dbname='testdb' user='testuser'"
        + " host='127.0.0.1' password='password' port='5433'")
    cur = con.cursor()

    cur.execute("SELECT data, filetype FROM images WHERE id = 1")
    row = cur.fetchone()
    con.close()

    req.content_type = row[1]
    req.write(row[0])
    return apache.OK

Pretty simple, right? I ran the test using ab, or Apache Benchmark, with the command:

ab -c 10 -t 15 http://localhost/<scriptname>

I ran the test three times for each script, here’s the results:

PHP - Maximum Requests/Second: 145.32
Python - Maximum Requests/Second: 1088.64
Apache2 Static Image - Max Requests/Second - 2778.54

WOW! Python is nearly 8 times faster than PHP in this test! I was so intrigued, I decided to dig deeper to see exactly what the bottleneck is in PHP.

I created a new test: Connect to the database, and close the connection immediately, don’t send any data whatsoever.

Here’s the results from that:

PHP - Max Requests/Second: 205.00
Python - Max Requests/Second: 1089.32

So PHP gains ~50 requests/second, and Python has no improvement at all. But what if we just leave the scripts blank? Basically just return and do nothing at all? Here’s the results from that:

PHP - Max Requests/Second: 1566.77
Python - Max Requests/Second: 1819.46

So basically, PHP sucks, at least when it comes to PostgreSQL connections. What’s strange is PHP doesn’t seem to have the same problem with MySQL connections, where by simply connecting to the database, performance drops significantly, so something is certainly wrong with the way PHP handles PostgreSQL. I bet there’s a lot of “MySQL vs. PostgreSQL” benchmarks out there that wield wildly inaccurate results because of this. Actually, if you compare MySQL and PostgreSQL purely in Python, you’ll find that PostgreSQL is quite a bit faster at handling binary data, such as images. I will be back if I find more information or a fix for this behavior.

UPDATE: it looks like the problem lies specifically with the pg_connect function. This problem doesn’t exists with pg_pconnect, however, you cannot connect to a database pool like pgpool/pgbounder with pg_pconnect, as that degrades performance.

PostgreSQL And Timestamps Inside Transactions

Sunday, November 18th, 2007

Calling now(), or current_timestamp within a transaction always returns the same value (the time the transaction started I think). Here’s a workaround that I’m using:

BEGIN;

SELECT timeofday()::timestamptz;

COMMIT;

Working Around ORDER BY RANDOM()

Monday, November 5th, 2007

Here’s what I’m doing to work around ORDER BY RANDOM() in PostgreSQL, which is generally slow:

  1. Get the maximum value for the primary key, eg, “SELECT MAX(id) FROM tbl”
  2. Get a random number within that range. In PHP, you would do “$rand_id = rand(1, $max_id);”
  3. Use the random number to fetch data, eg, “SELECT * FROM tbl WHERE id >= $max_id LIMIT 1″

I use “>=” as opposed to “=” to ensure that, if there’s any holes in the ID’s, it wont return an empty dataset.

Of course, this is only useful when obtaining a single record, as the starting position is the only thing that’s random. If you need any more records, you’d have to get a new random number and perform another query.

A quick and dirty benchmark shows a pretty good performance increase. On a 600 row table with PostgreSQL 8.2, ORDER BY RANDOM takes ~2.1ms, where the other method takes between 0.200ms and 0.300ms.

Optionally, you could replace “SELECT MAX(id) FROM tbl” with “SELECT last_value FROM tbl_seq” for a performance increase, assuming you know the name of the sequence for the table you’re working on.

PostgreSQL And Count()

Wednesday, October 31st, 2007

One of the most irritating things about PostgreSQL is it’s inability to perform count() in a timely manner. It takes forever because it’s MVCC (Multi Version Concurrency) engine, which makes it so it has to iterate through all the rows to get an accurate count.

In the past, you could work around it by scheduling a VACUUM every hour and using this:

SELECT reltuples AS count FROM pg_class WHERE relname = ‘table name’ AND relkind = ‘r’;

But now there seems to be quite a bit of discussion about rewriting aggregates like count() in PostgreSQL. In 8.3 Beta, count() has been cleaned up a lot so it’s much faster and performs better under high load. There’s also been discussion about caching count()’s, which may prove difficult because of the Multi-Version nature of PostgreSQL.

MySQL Anguish

Saturday, October 6th, 2007

For most projects, I use PostgreSQL. It’s great, it scales extremely well across multi-core multi-cpu machines, it’s loaded with features, and it handles everything you can throw at it with grace. I also try to maintain respect for MySQL, and the last thing I want to start is a flame war, but dammit, every time I try doing something with MySQL, I run across bugs, things that weren’t thought out, and very unelegant solutions.

Just for instance, today I was laying out some tables for a MySQL project I’m working on in my spare time, and I came across one of the most retarded things ever: you can’t create a field with “now()” as the default value, unless the field is an actual timestamp (ie., doesn’t work for date, time, etc). Now that wouldn’t be so bad if you could use current_date() for the default value, but apparently MySQL won’t let me do that either.

Here’s the result on MySQL 5.0:

mysql> CREATE TABLE test (d DATE NOT NULL DEFAULT CURRENT_DATE());

ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘CURRENT_DATE())’ at line 1

Update: Comments are disabled due to spamming.

This is why I love PostgreSQL

Saturday, August 18th, 2007

I’ve always loved PostgreSQL ever since I started really playing around with it’s potential. Now I always find myself stumbling upon new tricks and extremely elegant solutions to problems, that would be a pain in the ass to handle on the application side. Here’s a cool little feature I found while working on some logging stuff.

SELECT * FROM stats WHERE age(current_timestamp, created) <= ‘30 minutes’;

Very useful in some situations. I still haven’t found a MySQL equivalent.