Archive for the ‘SQL’ Category

PostgreSQL AGE() Function Not Indexable

Monday, February 9th, 2009

I was working a project that needed to be able to look up records, from a certain time frame. Interestingly enough, I discovered that the AGE() function cannot use indexes. My workaround, is to create a functional index that extracts the unix timestamp from the TIMESTAMP data type, and uses a BETWEEN clause.

So:

SELECT * FROM my_table
WHERE AGE(NOW(), my_timestamp) <= '1 day'::interval;

Becomes:

CREATE INDEX my_table_epoch_idx ON my_table
  (EXTRACT(epoch FROM my_timestamp));

SELECT * FROM my_table WHERE EXTRACT(epoch FROM my_timestamp)
  BETWEEN EXTRACT(epoch FROM NOW()) - 86400
  AND EXTRACT(epoch FROM NOW());

Both queries fetch items where my_timestamp is less than 24 hours old, but the latter has the ability to use an index.

PostgreSQL And SELECT DISTINCT On Large Tables

Thursday, January 22nd, 2009

I’ve been noticing lately that PostgreSQL’s SELECT DISTINCT operation seems to be un-optimized for large tables. Take a look:

# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
 Unique (time=3393.085..3615.420 rows=20799 loops=1)
 ->  Sort  (time=3393.082..3514.374 rows=204494 loops=1)
       Sort Key: mycol
       ->  Seq Scan (time=9.455..289.479 rows=204494)
 Total runtime: 3626.724 ms
(5 rows)

Now compare DISTINCT to GROUP BY:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol;
QUERY PLAN
------------------------
 HashAggregate (time=415.129..427.497 rows=20799)
   ->  Seq Scan (time=8.127..269.190 rows=204494)
 Total runtime: 433.836 ms
(3 rows)

That’s quite an improvement. 3626.724 ms compared to 433.836 ms. Even with ordering, GROUP BY is still faster:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Sort  (time=642.126..654.545 rows=20799 loops=1)
   Sort Key: mycol
   ->  HashAggregate  (time=408.922..423.040 rows=20799)
         ->  Seq Scan (time=7.519..263.577 rows=204494)
 Total runtime: 661.537 ms
(5 rows)

The main problem seems to be that PostgreSQL is choosing a bad query plan. Lets try forcing PostgreSQL to use an index scan on SELECT DISTINCT:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
Unique (actual time=0.013..361.917 rows=20799 loops=1)
   ->  Index Scan (actual time=0.012..266.856 rows=204494)
 Total runtime: 368.296 ms
(3 rows)

# SET enable_seqscan = on;
SET

Wow! Big improvement. Lets see how that compares to GROUP BY with sequential scans turned off:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Group (actual time=0.014..362.013 rows=20799)
   ->  Index Scan (actual time=0.011..267.001 rows=204494)
 Total runtime: 368.201 ms
(3 rows)

# SET enable_seqscan = on;
SET

So, at this point, the two queries are comparable. Tweaking the configuration sometimes causes PostgreSQL to choose a bad plan on other queries, so it seems like if PostgreSQL is choosing a bad plan for a query like this, you can simply turn sequential scans off for it.

PostGIS: Bounding Box Indexing

Friday, December 19th, 2008

You may have seen from my previous post, that I’ve been playing around with PostGIS. I had some more time to dig deeper into it, and I found that, although functions like distance_sphere/ST_distance_sphere aren’t indexable, bounding-box lookups are. Bounding box indexing can be used in conjunction with those functions, to improve the performance of queries, fairly easily.

This only works for GiST indexes, which can be confusing if you’re not familiar with indexes in PostgreSQL. Say the GEOMETRY column you want to index is geom on the table zip. Basically, you just do:

CREATE INDEX zip_geom_idx ON zip USING GIST (geom);

For searching for points that are within a certain radius of another, PostGIS provides a function called expand/ST_expand, which expands a single point to a box, which can then be used in an index lookup.


The queries essentially break down to:

SELECT ... FROM ... WHERE
geom && expand('some point', 1)
AND distance_sphere('some point', geom) <= (15 * 1609.344);

The first part of the WHERE clause, geom && expand(’some point’, 1), expands “some point” 1 degree in all directions to create a box, and finds points that are inside or on the border of the expanded box. This is the part that indexing is used on.

The second part of the WHERE clause, distance_sphere(’some point’, geom) <= (15 * 1609.344), refines the selection (the rows that were found in the index scan), down to points that are within 15 miles of 'some point'.

I usually just use subqueries and a join, rather than doing multiple fetches, so in my case it looks more like:

SELECT t.* FROM mytable t
JOIN zip z ON (z.zip = t.zip)
WHERE z.geom && expand(
    (SELECT geom FROM zip WHERE zip = '10001'), 1)
AND distance_sphere(geom,
    (SELECT geom FROM zip WHERE zip = '10001')) <= (15 * 1609.344);

PostGIS: Calculate Distance Between ZIP Codes

Friday, November 14th, 2008

Followup Post: PostGIS: Bounding Box Indexing

Recently, I’ve been playing around with PostGIS (that’s geographic information system). The function, distance_sphere (or ST_distance_sphere, depending on the version), is of particular interest to me. Here’s what the documentation has to say about it:

“ST_distance_sphere(point, point), Returns linear distance in meters between two lat/lon points. Uses a spherical earth and radius of 6370986 meters. Faster than distance_spheroid(), but less accurate. Only implemented for points.”

This function is especially useful for calculating the approx. distance between zip codes. Lets say you have a table like this:

CREATE TABLE zip (
    zip CHAR(5) NOT NULL PRIMARY KEY,
    latitude FLOAT8 NOT NULL,
    longitude FLOAT8 NOT NULL
);

You could simply add a new column like so:

ALTER TABLE zip ADD COLUMN geom GEOMETRY;

Then populate the new column:

UPDATE zip SET geom = makepoint(longitude, latitude);

The function distance_sphere, calculates the distance in meters. In order to convert the result to miles, it needs to by divided by 1609.344. Lets say, you want to get all zip codes within 15 miles another zip code, say 10001. You could simply do:

SELECT zip FROM zip WHERE
    distance_sphere(geom,
        (SELECT geom FROM zip WHERE zip = '10001'))
    <= (15 * 1609.344);

There doesn’t seem to be a way to use an index to speed the query up, but it is still significantly faster than other methods. On the application side, you could use something like Memcached to cache the results, if applicable.

There are more accurate ways to find the distance between lat/lon points, however, they also require more work, and are slower.

PostgreSQL: Handling Ratings, Without Aggregates

Friday, September 26th, 2008

This is something you shouldn’t have to learn the hard way: How to create a ratings system without using aggregate functions. Specifically, what I’m talking about by an aggregate, is the AVG() SQL function. The problem with using AVG(), is it reads every row in the table, which works well when you’re not dealing with very many rows, but as your data grows, the query speed progressively slows down. This is because aggregate functions essentially perform a sequential scan on the table. The solution for this, is to store the average for each individual item in a table, and use simple math calculations for updates.

Such a table may look something like this:

CREATE TABLE item_ratings (
    item_id INTEGER NOT NULL PRIMARY KEY,
    rating DOUBLE PRECISION,
    votes INTEGER NOT NULL DEFAULT 0
);

CREATE INDEX item_ratings_rating_key ON item_ratings (rating);
CREATE INDEX item_ratings_votes_key ON item_ratings (votes);

There’s a number of options for adding the initial entry, that updates will eventually be done on, such as doing it in code or via a trigger/rule. For the sake of simplicity, lets say there’s already a blank entry for item #1.

INSERT INTO item_ratings (item_id, rating, votes) VALUES (1, 0, 0);

Now, instead of adding a new entry for each vote cast, you simply update the previous entry. Lets say someone casts a vote, giving it a 4 star rating. That would be handled as such:

UPDATE item_ratings SET
    rating = ((rating * votes) + 4) / (votes + 1),
    votes = (votes + 1)
    WHERE item_id = 1;

SELECT * FROM item_ratings WHERE item_id = 1;
 item_id | rating | votes
---------+--------+-------
      1          4        1

That’s pretty much it. Now all ratings are stored per item, instead of generated by an aggregate. This can save major headaches when dealing with large datasets, since AVG() in PostgreSQL tends to be slow when working with such datasets.

PostgreSQL Quick Tip: Indexing Large Columns

Monday, July 14th, 2008

When working with large columns, you can run into some indexing issues. For instance, there’s a limit on how many characters a btree index entry can use, which poses problems for indexing text fields, since inserts and updates will fail if they use more characters than the index can use. Also, you can run into performance and disk consumption issues with large columns. The solution for this is to index columns using the hashtext() function. so, ie:

CREATE INDEX table1_column1_hash ON table1 hashtext(column1);

Then in your queries, do:

SELECT * FROM table1 WHERE hashtext(column1) = hashtext('foo');

There’s some issues with this solution, mainly being that you can only have exact matches for the column, so you can’t do range scans or LIKE queries, however, if you need searching features you should probably be using full text searching, which is available in PostgreSQL 8.2+.

PHP PDO and Bytea/Blob Columns

Friday, June 27th, 2008

I’ve searched high and low, but haven’t found any solid documentation on how to work with BLOB/BYTEA columns with PDO. Here’s what I’ve figured out.

By default, when using the query method, PDO will return a Resource ID for binary objects. The Resource ID must be used together with fread(). So, ie:


$c = $db->query('SELECT blob FROM table WHERE id = 21');
$res = $c->fetch(PDO::FETCH_ASSOC);
$buf = null;

while (!feof($res['blob'])) {
    $buf .= fread($res['blob'], 2048);
}

fclose($res['blob']);

You can also fetch binary objects as a string, by binding the column with PDO::PARAM_STR:

$buf = null;
$st = $db->query('SELECT blob FROM table WHERE id = 21');
$st->bindColumn('blob', $buf, PDO::PARAM_STR);
$st->fetch();

It’s hard to believe there’s no documentation on this. I really hope I just missed it; this would be a pretty big thing to have no documentation on.

Update: The documentation is here, hidden by obscurity. It sounds like it’s talking about handling external large objects, and not internal blob columns, inside the table.

Regular Expressions In SQL

Wednesday, June 11th, 2008

Regular expressions are pretty easy to use, however, different DBMS’s handle them differently.

In PostgreSQL, regular expression searches are done via the “~” character. For instance:

SELECT * FROM table1 WHERE column1 ~ '^(.?)unq([a-z]{4})([1-9]+)$';

In MySQL, this is done via REGEXP:

SELECT * FROM table1 WHERE column1 REGEXP '^S([c-z])n(.*)$';

In Oracle, this is done via the REGEXP_LIKE function:

SELECT * FROM table1 WHERE REGEXP_LIKE(column1, '[[:digit:]]{2}$');