Archive for the ‘SQL’ Category

PHP And pg_pconnect - Persistent Connections Suck

Thursday, January 10th, 2008

Lately, a server I’ve been maintaining has been running out of database connections. Upon further investigation, the culprit is clearly PHP’s pg_pconnect function. Persistent connections with PHP is buggy to say the least, and it’s often mistaken for connection pooling. Connection pooling is different from the persistent connection implementation that PHP uses, since persistent connections are only per Apache process. That means that each Apache process (likely per user), has it’s own pool of connections that only it can use. What’s worse is the connection isn’t closed until the Apache process times out, so most of the open connections are idle. The flaw in this design is astonishing. Right now, there’s 38 open database connections and only two of them are actively being used. Also, by default, PHP has pgsql.max_persistent set to -1, meaning each Apache process is allowed unlimited database connections. If you’re set on using persistent connections over pooling, you should set that to no more than 1 or 2.

The real solution is to use a “middleman” process to provide true connection pooling. This can easily be done easily with pgpool.

This is the pgpool configuration I’m using right now:

listen_addresses = ‘localhost’
port = 5433
socket_dir = ‘/tmp’
backend_host_name = ‘localhost’
backend_port = 5432
backend_socket_dir = ‘/var/run/postgresql’

Now, the only change you have to make to your PHP projects, is to use pg_connect instead of pg_pconnect, and connect to port 5433, instead of 5432. That’s it! Pgpool will automatically pool your database connections, and you will avoid running out of connections!

hmnos ocea
uogu rfsinlduhbg

PostgreSQL Slow Count() Workaround

Wednesday, December 12th, 2007

You may also be interested in this. It’s a far more elegant solution, if you only need the entire number of rows in a table.


MySQL zealots tend to point to PostgreSQL’s slow count() as a weakness, however, in the real world, count() isn’t used very often, and if it’s really needed, most good database systems provide a framework for you to build a workaround. The reason MySQL has fast counts, is because, they’re cached. This works in MySQL because MySQL uses serialized INSERTs, but higher grade DBs are transactional, and PostgreSQL uses MVCC, so caching row counts produces inaccurate results.

Here’s an example count() workaround for PostgreSQL, using a combination of PL/PgSQL and TRIGGERs. You could certainly extend this model, adding more columns to the cached rows table, so you can simulate SELECT count(id) WHERE foo = ‘bar’, although you’d have to add a TRIGGER for UPDATEs.


--- This table will be used as a dummy table,
--- that we wan't to get a count() from.
CREATE TABLE tbl1 (
    id INTEGER
);

--- This table is where the cached counts will be stored
CREATE TABLE tbl1_num_rows (
    num_rows INTEGER NOT NULL
);

--- Function to update the count in tbl1_num_rows,
--- on an INSERT
CREATE OR REPLACE FUNCTION tbl1_up_rows_insert ()
RETURNS TRIGGER AS $$
    DECLARE
        row RECORD;
    BEGIN
        --- Auto-add record
        SELECT INTO row num_rows FROM tbl1_num_rows;
        IF NOT FOUND THEN
            INSERT INTO tbl1_num_rows VALUES (0);
        END IF;

        UPDATE tbl1_num_rows SET num_rows =
		((SELECT num_rows FROM tbl1_num_rows) + 1);

        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

--- Function to update the count in tbl1_num_rows,
--- on a DELETE
CREATE OR REPLACE FUNCTION tbl1_up_rows_delete ()
RETURNS TRIGGER AS $$
    DECLARE
        row RECORD;
    BEGIN
        --- Auto-add record
        SELECT INTO row num_rows FROM tbl1_num_rows;
        IF NOT FOUND THEN
            INSERT INTO tbl1_num_rows VALUES (0);
        ELSE
            IF row.num_rows > 0 THEN
                UPDATE tbl1_num_rows SET num_rows =
                    ((SELECT num_rows FROM tbl1_num_rows) - 1);
            END IF;
        END IF;

        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

--- Add the TRIGGERS to tbl1
CREATE TRIGGER tbl1_up_rows_on_insert AFTER INSERT ON tbl1 FOR EACH ROW
EXECUTE PROCEDURE tbl1_up_rows_insert();

CREATE TRIGGER tbl1_up_rows_on_delete AFTER DELETE ON tbl1 FOR EACH ROW
EXECUTE PROCEDURE tbl1_up_rows_delete();

Why The Statement “MySQL Is Faster Than PostgreSQL” Is Wrong

Friday, December 7th, 2007

I’ve come across a lot of hardcore MySQL fanboys who say they use MySQL because it’s faster than PostgreSQL. On a certain level, this statement is wrong. While I agree that, generally speaking, MySQL is faster than PostgreSQL, you have to understand that developers like that don’t take load into account. Most of the benchmarks they cite aren’t real world benchmarks; they test how fast data is delivered from the database, but don’t test how well the database performs under high load. In that respect, PostgreSQL beats MySQL hands down on multi-core, multi-processor, real-world servers, while MySQL crumbles under the load. This is due to the way PostgreSQL is built; PostgreSQL is an excellent example of good multi-threaded programming, and it scales almost linearly up to about 16 cores (as of version 8.0). And honestly, who cares if MySQL can deliver a simple query 0.01ms faster than PostgreSQL under no load? Promoting your database of choice with that argument is childish.

But that’s not the reason I like PostgreSQL. I like it because it offers a wealth of features that make my life a whole lot easier, it allows for better data integrity than MySQL, and it does all that with far less “gotcha’s” than MySQL.

Preserving Database Compatibiliy

Friday, November 30th, 2007

There’s always a lot of talk about preserving database compatibility, creating database abstraction layers, providing syntax compatibility, etc. At some point, you have to ask yourself, is it worth the effort? In most cases, it’s not. Although I highly recommend database abstraction layers, like PEAR::MDB2, unless you really need to have a project run on more than one database system, trying to keep compatibility is just silly. The reason is, MySQL, PostgreSQL, and Oracle all differ in several major ways. For instance, PostgreSQL and Oracle use Sequences, and MySQL uses it’s own auto_increment for primary keys, and the name of datatypes in Oracle differ greatly from the names of datatypes in MySQL and PostgreSQL. Also, each of those databases differ in behavior, such as MySQL having loose standards, and PostgreSQL and Oracle having strict standards, which can be seen with INSERTs on MySQL, where it will accept data that doesn’t meet the restrictions put on the column.

Keep in mind that in order to keep compatibility, you have to sacrifice speed. Now, for most small projects, sacrificing speed isn’t that big of an issue, but for large high traffic websites, sacrificing speed costs a lot, in server requirements, scalability, and user experience.

If you’re thinking of making a cross-database project, ask yourself these questions:

  • Are you expecting your project to be used on a wide variety of servers?
  • Are you expecting a large base of people who rely on your project?
  • Will making your project only compatible with a single database hurt it?
  • Will your project be publicly available, or sold to several different clients?

If you answer yes to one of these, than making your project cross-database compatible may be a good idea. Also, if you’re making something like a forum or blog package, than cross-database compatibility is probably a good idea.

PostgreSQL And Timestamps Inside Transactions

Sunday, November 18th, 2007

Calling now(), or current_timestamp within a transaction always returns the same value (the time the transaction started I think). Here’s a workaround that I’m using:

BEGIN;

SELECT timeofday()::timestamptz;

COMMIT;

Working Around ORDER BY RANDOM()

Monday, November 5th, 2007

Here’s what I’m doing to work around ORDER BY RANDOM() in PostgreSQL, which is generally slow:

  1. Get the maximum value for the primary key, eg, “SELECT MAX(id) FROM tbl”
  2. Get a random number within that range. In PHP, you would do “$rand_id = rand(1, $max_id);”
  3. Use the random number to fetch data, eg, “SELECT * FROM tbl WHERE id >= $max_id LIMIT 1″

I use “>=” as opposed to “=” to ensure that, if there’s any holes in the ID’s, it wont return an empty dataset.

Of course, this is only useful when obtaining a single record, as the starting position is the only thing that’s random. If you need any more records, you’d have to get a new random number and perform another query.

A quick and dirty benchmark shows a pretty good performance increase. On a 600 row table with PostgreSQL 8.2, ORDER BY RANDOM takes ~2.1ms, where the other method takes between 0.200ms and 0.300ms.

Optionally, you could replace “SELECT MAX(id) FROM tbl” with “SELECT last_value FROM tbl_seq” for a performance increase, assuming you know the name of the sequence for the table you’re working on.

PostgreSQL And Count()

Wednesday, October 31st, 2007

One of the most irritating things about PostgreSQL is it’s inability to perform count() in a timely manner. It takes forever because it’s MVCC (Multi Version Concurrency) engine, which makes it so it has to iterate through all the rows to get an accurate count.

In the past, you could work around it by scheduling a VACUUM every hour and using this:

SELECT reltuples AS count FROM pg_class WHERE relname = ‘table name’ AND relkind = ‘r’;

But now there seems to be quite a bit of discussion about rewriting aggregates like count() in PostgreSQL. In 8.3 Beta, count() has been cleaned up a lot so it’s much faster and performs better under high load. There’s also been discussion about caching count()’s, which may prove difficult because of the Multi-Version nature of PostgreSQL.