PostGIS: Calculate Distance Between ZIP Codes

Followup Post: PostGIS: Bounding Box Indexing

Recently, I’ve been playing around with PostGIS (that’s geographic information system). The function, distance_sphere (or ST_distance_sphere, depending on the version), is of particular interest to me. Here’s what the documentation has to say about it:

“ST_distance_sphere(point, point), Returns linear distance in meters between two lat/lon points. Uses a spherical earth and radius of 6370986 meters. Faster than distance_spheroid(), but less accurate. Only implemented for points.”

This function is especially useful for calculating the approx. distance between zip codes. Lets say you have a table like this:

CREATE TABLE zip (
    zip CHAR(5) NOT NULL PRIMARY KEY,
    latitude FLOAT8 NOT NULL,
    longitude FLOAT8 NOT NULL
);

You could simply add a new column like so:

ALTER TABLE zip ADD COLUMN geom GEOMETRY;

Then populate the new column:

UPDATE zip SET geom = makepoint(longitude, latitude);

The function distance_sphere, calculates the distance in meters. In order to convert the result to miles, it needs to by divided by 1609.344. Lets say, you want to get all zip codes within 15 miles another zip code, say 10001. You could simply do:

SELECT zip FROM zip WHERE
    distance_sphere(geom,
        (SELECT geom FROM zip WHERE zip = '10001'))
    <= (15 * 1609.344);

There doesn’t seem to be a way to use an index to speed the query up, but it is still significantly faster than other methods. On the application side, you could use something like Memcached to cache the results, if applicable.

There are more accurate ways to find the distance between lat/lon points, however, they also require more work, and are slower.

6 Responses to “PostGIS: Calculate Distance Between ZIP Codes”

  1. Ian Eure Says:

    The problem with your query is that the expensive distance_sphere() calculation as to run for every zip in your table, even though only a small number will be returned by the query.

    You should probably nix that sub-SELECT and use a JOIN:

    SELECT d.`zip`
    FROM `zip` s JOIN `zip` d
    ON (distance_sphere(s.`geom`, d.`geom)

  2. Ian Eure Says:

    Oh, and on the subject of indexes: you’d want lat/lon indexed for my queries to work well. I don’t know how Postgres does it, but MySQL will only use a single index per query, which is extremely lame. So: you may need a compound index on (zip, lat, lon) for the query to work well.

    No amount of indexing will help your version of the query, because you’re doing a brute-force operation; comparing the distance from your source zip to every other. The only part an index is useful for is the PK lookup on the zip column.

  3. admin Says:

    @Ian

    The sub-select is way faster than the join, because PostgreSQL will grab the result from the sub-select once, and substitute the result in the query (processing all rows in the table once), where as in the join, PostgreSQL has to compare every row in the zip table against every row in the zip table (say you have 50,000 rows, then the join has to process 50,000^2 rows, or 2.5 billion rows). Both queries do sequential scans, the sub-select version simply cuts out the majority of the work.

    Currently, there are only two ways I know about, for PostgreSQL to use indexing in this situation. The first method is to create a materialized view, index that, and run all your queries on it. The second method is to create an index for every zip code, such as (where 10001_geom, is the geometry for the 10001 zip code):

    CREATE INDEX zip_zip_10001 ON zip (distance_sphere(geom, 10001_geom));

    The main issue being that the function result is what needs to be indexed (not the columns). Both methods use a lot of disk space, so doing a sequential scan and caching results temporarily makes more sense.

  4. Ian Eure Says:

    Your blog ate most of my first comment, which went on to optimize the query significantly. The query should have a WHERE s.zip = 10001, which makes it O(n+1). Obviously, there’s no point in the O(nn)-ness of an unconstrained JOIN, since we only care about the distance of one zip to the others.

    What I described was an optimization where you use the JOIN to narrow the resultset to an approximate area, then refine it with distance_sphere(). So you tighten the focus of the expensive operation to the results which are most likely to be relevant.

    In other words, if you’re looking for something in Delaware, don’t bother looking in Nevada.

    The magic here is that the average distance of one degree of lat/lon is 69.2 miles. So you can:

    SELECT d.* FROM `zip` s JOIN `zip` d
    WHERE s.`zip` = 10001
    AND (ABS(s.`lat` - d.`lat`)

  5. Atomized » Blog Archive » Optimizing distance calculations Says:

    […] brother wrote about calculating distances with PostGIS. I followed up with some suggestions, but WordPress seems to be eating my comments (strip_tags() is […]

  6. Ian Eure Says:

    Yeah, your blog nukes anything after a less than sign.

    See http://atomized.org/2008/11/optimizing-distance-calculations/ to see where I was going.

Leave a Reply