Archive for the ‘PostgreSQL’ Category

PostGIS: Calculate Distance Between ZIP Codes

Friday, November 14th, 2008

Recently, I’ve been playing around with PostGIS (that’s geographic information system). The function, distance_sphere (or ST_distance_sphere, depending on the version), is of particular interest to me. Here’s what the documentation has to say about it:

“ST_distance_sphere(point, point), Returns linear distance in meters between two lat/lon points. Uses a spherical earth and radius of 6370986 meters. Faster than distance_spheroid(), but less accurate. Only implemented for points.”

This function is especially useful for calculating the approx. distance between zip codes. Lets say you have a table like this:

CREATE TABLE zip (
    zip CHAR(5) NOT NULL PRIMARY KEY,
    latitude FLOAT8 NOT NULL,
    longitude FLOAT8 NOT NULL
);

You could simply add a new column like so:

ALTER TABLE zip ADD COLUMN geom GEOMETRY;

Then populate the new column:

UPDATE zip SET geom = makepoint(longitude, latitude);

The function distance_sphere, calculates the distance in meters. In order to convert the result to miles, it needs to by multiplied by 1609.344. Lets say, you want to get all zip codes within 15 miles another zip code, say 10001. You could simply do:

SELECT zip FROM zip WHERE
    distance_sphere(geom,
        (SELECT geom FROM zip WHERE zip = '10001'))
    <= (15 * 1609.344);

There doesn’t seem to be a way to use an index to speed the query up, but it is still significantly faster than other methods. On the application side, you could use something like Memcached to cache the results, if applicable.

There are more accurate ways to find the distance between lat/lon points, however, they also require more work, and are slower.

PostgreSQL: Handling Ratings, Without Aggregates

Friday, September 26th, 2008

This is something you shouldn’t have to learn the hard way: How to create a ratings system without using aggregate functions. Specifically, what I’m talking about by an aggregate, is the AVG() SQL function. The problem with using AVG(), is it reads every row in the table, which works well when you’re not dealing with very many rows, but as your data grows, the query speed progressively slows down. This is because aggregate functions essentially perform a sequential scan on the table. The solution for this, is to store the average for each individual item in a table, and use simple math calculations for updates.

Such a table may look something like this:

CREATE TABLE item_ratings (
    item_id INTEGER NOT NULL PRIMARY KEY,
    rating DOUBLE PRECISION,
    votes INTEGER NOT NULL DEFAULT 0
);

CREATE INDEX item_ratings_rating_key ON item_ratings (rating);
CREATE INDEX item_ratings_votes_key ON item_ratings (votes);

There’s a number of options for adding the initial entry, that updates will eventually be done on, such as doing it in code or via a trigger/rule. For the sake of simplicity, lets say there’s already a blank entry for item #1.

INSERT INTO item_ratings (item_id, rating, votes) VALUES (1, 0, 0);

Now, instead of adding a new entry for each vote cast, you simply update the previous entry. Lets say someone casts a vote, giving it a 4 star rating. That would be handled as such:

UPDATE item_ratings SET
    rating = ((rating * votes) + 4) / (votes + 1),
    votes = (votes + 1)
    WHERE item_id = 1;

SELECT * FROM item_ratings WHERE item_id = 1;
 item_id | rating | votes
---------+--------+-------
      1          4        1

That’s pretty much it. Now all ratings are stored per item, instead of generated by an aggregate. This can save major headaches when dealing with large datasets, since AVG() in PostgreSQL tends to be slow when working with such datasets.

PostgreSQL Quick Tip: Indexing Large Columns

Monday, July 14th, 2008

When working with large columns, you can run into some indexing issues. For instance, there’s a limit on how many characters a btree index entry can use, which poses problems for indexing text fields, since inserts and updates will fail if they use more characters than the index can use. Also, you can run into performance and disk consumption issues with large columns. The solution for this is to index columns using the hashtext() function. so, ie:

CREATE INDEX table1_column1_hash ON table1 hashtext(column1);

Then in your queries, do:

SELECT * FROM table1 WHERE hashtext(column1) = hashtext('foo');

There’s some issues with this solution, mainly being that you can only have exact matches for the column, so you can’t do range scans or LIKE queries, however, if you need searching features you should probably be using full text searching, which is available in PostgreSQL 8.2+.

PHP PDO and Bytea/Blob Columns

Friday, June 27th, 2008

I’ve searched high and low, but haven’t found any solid documentation on how to work with BLOB/BYTEA columns with PDO. Here’s what I’ve figured out.

By default, when using the query method, PDO will return a Resource ID for binary objects. The Resource ID must be used together with fread(). So, ie:


$c = $db->query('SELECT blob FROM table WHERE id = 21');
$res = $c->fetch(PDO::FETCH_ASSOC);
$buf = null;

while (!feof($res['blob'])) {
    $buf .= fread($res['blob'], 2048);
}

fclose($res['blob']);

You can also fetch binary objects as a string, by binding the column with PDO::PARAM_STR:

$buf = null;
$st = $db->query('SELECT blob FROM table WHERE id = 21');
$st->bindColumn('blob', $buf, PDO::PARAM_STR);
$st->fetch();

It’s hard to believe there’s no documentation on this. I really hope I just missed it; this would be a pretty big thing to have no documentation on.

Update: The documentation is here, hidden by obscurity. It sounds like it’s talking about handling external large objects, and not internal blob columns, inside the table.

PostgreSQL Planing Built In Replication

Wednesday, June 18th, 2008

Yes, that’s right, the PostgreSQL team is planning to add built in replication. This has come, in part, do to the demand of replication solutions, and the lack of easy-to-setup, easy-to-maintain replication for PostgreSQL. Many projects have moved to MySQL, simply because developers don’t want to deal with the headaches of Slony-I or PGCluster.

The features are planned to ship with PostgreSQL 8.4, however, do to technical issues, read-only queries to slaves may not be ready until 8.5.

This is going to be a huge leap for PostgreSQL.

The full announcement is on the PostgreSQL mailing list.

Regular Expressions In SQL

Wednesday, June 11th, 2008

Regular expressions are pretty easy to use, however, different DBMS’s handle them differently.

In PostgreSQL, regular expression searches are done via the “~” character. For instance:

SELECT * FROM table1 WHERE column1 ~ '^(.?)unq([a-z]{4})([1-9]+)$';

In MySQL, this is done via REGEXP:

SELECT * FROM table1 WHERE column1 REGEXP '^S([c-z])n(.*)$';

In Oracle, this is done via the REGEXP_LIKE function:

SELECT * FROM table1 WHERE REGEXP_LIKE(column1, '[[:digit:]]{2}$');

PostgreSQL: Ultra Simple COUNT(*) Workaround

Monday, April 21st, 2008

Just a little something I whipped up. It’s only for getting the total number of rows in table X. If you have under 10,000 rows, the performance hit of COUNT(*) is neglectable and you probably don’t need this.

Example:

SELECT __count_add('my_table');
SELECT rows FROM __count_tables WHERE tablename = 'my_table';

Ultra Simple COUNT(*) Workaround

Indexing Functions With PostgreSQL

Tuesday, April 15th, 2008

In PostgreSQL, it’s possible to create expressional indexes. In other words, indexes such as:

CREATE INDEX varchar_column_idx ON bar (UPPER(varchar_column));

It is possible to do this on any function that is marked as IMMUTABLE. IMMUTABLE just means that the function will always return the same result when given the same input. This is very powerful, as you can create indexes on user created functions, which can dramatically improve performance.

For instance, lets say I need to group some data by week, with every week starting on Sunday. I could do:


CREATE OR REPLACE FUNCTION find_previous_sunday
(TIMESTAMP) RETURNS DATE AS $$
    BEGIN
        RETURN
        ($1 + (-extract(dow from $1) || ' days')::interval)::date;
    END;
$$ LANGUAGE plpgsql IMMUTABLE;

Note, that you can always do extract(week from timestamp), but there isn’t much flexibility there, and clients are often picky about when they want their weeks to start, plus this is only an example. Onwards to the benchmark:


---
--- Test without indexing
---
EXPLAIN ANALYZE SELECT
    find_previous_sunday(entry_time) AS week_start,
    SUM(distance)
    FROM tbl
    WHERE user_id = 1
    GROUP BY week_start
    ORDER BY week_start DESC limit 1;

    QUERY PLAN
-------------------------------
 Limit  (... loops=1)
   ->  GroupAggregate  (...)(actual time=1.561..1.561...)
         ->  Sort  (...) (actual time=1.556..1.556 rows=2...)
               Sort Key: find_previous_sunday(entry_time)
               ->  Index Scan using tbl_user_id on tbl
                        (...) (actual time=0.060..1.388 rows=128...)
                     Index Cond: (user_id = 1)
 Total runtime: 1.651 ms
(7 rows)

---
--- Index find_previous_sunday function, and run test again.
---
CREATE INDEX tbl_week_start ON tbl
    (find_previous_sunday(entry_time));

VACUUM FULL ANALYZE;

EXPLAIN ANALYZE SELECT
    find_previous_sunday(entry_time) AS week_start,
    SUM(distance)
    FROM tbl
    WHERE user_id = 1
    GROUP BY week_start
    ORDER BY week_start DESC limit 1;

       QUERY PLAN
-------------------------------
 Limit  (...) (actual time=0.062..0.062 rows=1 loops=1)
   ->  GroupAggregate  (...) (actual time=0.062..0.062...)
         ->  Index Scan Backward using tbl_week_start on tbl
                  (...) (actual time=0.042..0.057 rows=2...)
               Filter: (user_id = 1)
 Total runtime: 0.096 ms
(5 rows)

1.651ms without indexing, 0.096ms with. Note that there’s only about 200 rows in the test table, and performance without indexing scales very poorly as the number of rows increases.