Archive for the ‘PostgreSQL’ Category

PostgreSQL Planing Built In Replication

Wednesday, June 18th, 2008

Yes, that’s right, the PostgreSQL team is planning to add built in replication. This has come, in part, do to the demand of replication solutions, and the lack of easy-to-setup, easy-to-maintain replication for PostgreSQL. Many projects have moved to MySQL, simply because developers don’t want to deal with the headaches of Slony-I or PGCluster.

The features are planned to ship with PostgreSQL 8.4, however, do to technical issues, read-only queries to slaves may not be ready until 8.5.

This is going to be a huge leap for PostgreSQL.

The full announcement is on the PostgreSQL mailing list.

Regular Expressions In SQL

Wednesday, June 11th, 2008

Regular expressions are pretty easy to use, however, different DBMS’s handle them differently.

In PostgreSQL, regular expression searches are done via the “~” character. For instance:

SELECT * FROM table1 WHERE column1 ~ '^(.?)unq([a-z]{4})([1-9]+)$';

In MySQL, this is done via REGEXP:

SELECT * FROM table1 WHERE column1 REGEXP '^S([c-z])n(.*)$';

In Oracle, this is done via the REGEXP_LIKE function:

SELECT * FROM table1 WHERE REGEXP_LIKE(column1, '[[:digit:]]{2}$');

PostgreSQL: Ultra Simple COUNT(*) Workaround

Monday, April 21st, 2008

Just a little something I whipped up. It’s only for getting the total number of rows in table X. If you have under 10,000 rows, the performance hit of COUNT(*) is neglectable and you probably don’t need this.

Example:

SELECT __count_add('my_table');
SELECT rows FROM __count_tables WHERE tablename = 'my_table';

Ultra Simple COUNT(*) Workaround

Indexing Functions With PostgreSQL

Tuesday, April 15th, 2008

In PostgreSQL, it’s possible to create expressional indexes. In other words, indexes such as:

CREATE INDEX varchar_column_idx ON bar (UPPER(varchar_column));

It is possible to do this on any function that is marked as IMMUTABLE. IMMUTABLE just means that the function will always return the same result when given the same input. This is very powerful, as you can create indexes on user created functions, which can dramatically improve performance.

For instance, lets say I need to group some data by week, with every week starting on Sunday. I could do:


CREATE OR REPLACE FUNCTION find_previous_sunday
(TIMESTAMP) RETURNS DATE AS $$
    BEGIN
        RETURN
        ($1 + (-extract(dow from $1) || ' days')::interval)::date;
    END;
$$ LANGUAGE plpgsql IMMUTABLE;

Note, that you can always do extract(week from timestamp), but there isn’t much flexibility there, and clients are often picky about when they want their weeks to start, plus this is only an example. Onwards to the benchmark:


---
--- Test without indexing
---
EXPLAIN ANALYZE SELECT
    find_previous_sunday(entry_time) AS week_start,
    SUM(distance)
    FROM tbl
    WHERE user_id = 1
    GROUP BY week_start
    ORDER BY week_start DESC limit 1;

    QUERY PLAN
-------------------------------
 Limit  (... loops=1)
   ->  GroupAggregate  (...)(actual time=1.561..1.561...)
         ->  Sort  (...) (actual time=1.556..1.556 rows=2...)
               Sort Key: find_previous_sunday(entry_time)
               ->  Index Scan using tbl_user_id on tbl
                        (...) (actual time=0.060..1.388 rows=128...)
                     Index Cond: (user_id = 1)
 Total runtime: 1.651 ms
(7 rows)

---
--- Index find_previous_sunday function, and run test again.
---
CREATE INDEX tbl_week_start ON tbl
    (find_previous_sunday(entry_time));

VACUUM FULL ANALYZE;

EXPLAIN ANALYZE SELECT
    find_previous_sunday(entry_time) AS week_start,
    SUM(distance)
    FROM tbl
    WHERE user_id = 1
    GROUP BY week_start
    ORDER BY week_start DESC limit 1;

       QUERY PLAN
-------------------------------
 Limit  (...) (actual time=0.062..0.062 rows=1 loops=1)
   ->  GroupAggregate  (...) (actual time=0.062..0.062...)
         ->  Index Scan Backward using tbl_week_start on tbl
                  (...) (actual time=0.042..0.057 rows=2...)
               Filter: (user_id = 1)
 Total runtime: 0.096 ms
(5 rows)

1.651ms without indexing, 0.096ms with. Note that there’s only about 200 rows in the test table, and performance without indexing scales very poorly as the number of rows increases.

One Of PostgreSQL’s Biggest Weaknesses

Sunday, April 13th, 2008

One of PostgreSQL’s biggest weaknesses is easy replication and clustering. Note, that I said easy. There’s absolutely replication solutions for PostgreSQL, but many of them have serious downfalls, and are a serious pain in the ass to setup.

Slony-I is a decent single-master multi-slave solution, if you don’t need LOBs replicated. It’s a little hard to setup, and you have to learn it’s special syntax for the config files you will have to create. A serious pitfall is it is ascynchronous, and therefore doesn’t guarantee all nodes will have the same information at any given time. Writes don’t scale very well. Also, if the master server goes down, you’re pretty much screwed unless you have a really cleaver way of failing over, since:

  • Sequence replication isn’t always reliable.
  • Triggers are disabled on all nodes, except for the master, to make sure they’re not called twice.
  • Incomplete failover results in errors being thrown on all nodes, including the failover master, with writes.

When using Slony-I, you need something like pgpool-II so you can transparently load balance nodes and redirect writes to the master server.

pgpool-II also has a multi-master replication mode, however, it absolutely sucks. The reason? It’s statement based, so instead of doing an insert and copying the result to all other nodes, it runs the same statement on all nodes, resulting in drift on all fields that use default values.

PgCluster seems to fix just about all of the problems with Slony-I and pgpool-II, by providing synchronous multi-master replication, however:

  • It’s basically a patchset to PostgreSQL, so it tends to run a release behind.
  • It’s a major, major, MAJOR, pain in the ass to setup.

Conditional UPDATE And DELETE In PostgreSQL

Friday, April 4th, 2008

One of the coolest PostgreSQL features: Conditional SELECTs/UPDATEs/DELETEs.

Lets say we have a table of distances, but we need to add a column for the distance type, ie, Miles or Kilometers. Conditional UPDATE to the rescue!!!:

UPDATE distances SET measurement =
    (CASE WHEN (name ~* '^[0-9]+K|km.$|kilometers$')
        THEN 'kilometers'
    WHEN (name ~* '^[0-9]+M|mi.$|miles$')
        THEN 'miles'
    ELSE 'unknown' END);

They also work for DELETEs, but I haven’t found any real world situation where it makes a lot of sense. Regardless:

DELETE FROM table WHERE
    (CASE WHEN col1 ~* '^foo|bar$'
        THEN 't'::boolean
    WHEN col2 ~* '^oof|rab$'
        THEN 't'::boolean
    ELSE 'f'::boolean END);

Constraints In PostgreSQL

Friday, March 28th, 2008

After a very long time, I decided it was time to rewrite a script I created so I could move databases from one version of PostgreSQL to another. For instance, I could move a PostgreSQL 8.3 database to any version on PostgreSQL after 7.3. The problem was, I never got around to implementing constraints in it. Hopefully someone can find this info useful.

Constraint information is held in the pg_constraint table. The column, conrelid, defines the relation id of the table the constraint is on, and it corresponds to the relfilenode column in pg_class. This column is used for the join.

Constraint type is held in the contype column of the pg_constraint table. Here’s a map of the different types:

  • f = Foreign Key
  • p = Primary Key
  • u = Unique
  • c = Check Constraint

The consrc column holds information about check constraints, so for instance, we can do:

SELECT
    relname,
    conname,
    consrc
    FROM pg_constraint, pg_class
    WHERE conrelid = relfilenode AND contype = 'c';

relname     |  conname                     | consrc
---------------------------------------------------------------------
ratings      | ratings_chk_rating    | ((rating >= 1) AND (rating <= 5))

(1 row)

Primary Keys and Unique constraints are a bit more complex, as you have to find the column name as well. You can do something like this (Note: this won’t work for multi-column primary/unique keys):

SELECT
    relname,
    attname,
    conname,
    CASE WHEN contype = 'p' THEN 'primary' ELSE 'unique' END AS type
    FROM pg_constraint
        INNER JOIN pg_class ON (conrelid = relfilenode)
        INNER JOIN pg_attribute ON (conrelid = attrelid AND conkey[1] = attnum)
    WHERE contype = 'u' OR contype = 'p';

Will return something like:

relname | attname | conname       | type
--------------------------------------------
stores    | id             | stores_pkey | primary

Foreign Keys are even more complex, as you have to join pg_attribute and pg_class twice:

SELECT
    c1.relname AS local_table,
    c2.relname AS foreign_table,
    conname,
    a1.attname AS local_key,
    a2.attname AS foreign_key,
    condeferrable
    FROM pg_constraint
        INNER JOIN pg_class c1 ON (conrelid = c1.relfilenode)
        INNER JOIN pg_class c2 ON (confrelid = c2.relfilenode)
        INNER JOIN pg_attribute a1 ON (conrelid = a1.attrelid AND conkey[1] = a1.attnum)
        INNER JOIN pg_attribute a2 ON (confrelid = a2.attrelid AND confkey[1] = a2.attnum)
    WHERE contype = 'f';

Bashing VACUUM? Now What’s Wrong With This Picture?

Thursday, March 20th, 2008

One thing I get sick of is MySQL users bashing PostgreSQL for having the VACUUM command. They basically say the usual: VACUUM is slow, it’s flawed design, why should I have to VACUUM, etc. This, despite the fact that MySQL has a command that does almost the exact same thing, the difference of course being that PostgreSQL also collects stats from the database to optimize how the planner functions.

VACUUM, of course, is slow and uses a lot of CPU/Disk, but seriously, what’s stopping you from setting up a cron script that does `echo “VACUUM ANALYZE” | nice -n 19 psql database_name`, every night at midnight? I have a ~300MB database, and it takes all of three seconds to VACUUM ANALYZE. The only problem would be if you have a humongous database, or a slow server.