Archive for January, 2009

PostgreSQL And SELECT DISTINCT On Large Tables

Thursday, January 22nd, 2009

I’ve been noticing lately that PostgreSQL’s SELECT DISTINCT operation seems to be un-optimized for large tables. Take a look:

# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
 Unique (time=3393.085..3615.420 rows=20799 loops=1)
 ->  Sort  (time=3393.082..3514.374 rows=204494 loops=1)
       Sort Key: mycol
       ->  Seq Scan (time=9.455..289.479 rows=204494)
 Total runtime: 3626.724 ms
(5 rows)

Now compare DISTINCT to GROUP BY:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol;
QUERY PLAN
------------------------
 HashAggregate (time=415.129..427.497 rows=20799)
   ->  Seq Scan (time=8.127..269.190 rows=204494)
 Total runtime: 433.836 ms
(3 rows)

That’s quite an improvement. 3626.724 ms compared to 433.836 ms. Even with ordering, GROUP BY is still faster:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Sort  (time=642.126..654.545 rows=20799 loops=1)
   Sort Key: mycol
   ->  HashAggregate  (time=408.922..423.040 rows=20799)
         ->  Seq Scan (time=7.519..263.577 rows=204494)
 Total runtime: 661.537 ms
(5 rows)

The main problem seems to be that PostgreSQL is choosing a bad query plan. Lets try forcing PostgreSQL to use an index scan on SELECT DISTINCT:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
Unique (actual time=0.013..361.917 rows=20799 loops=1)
   ->  Index Scan (actual time=0.012..266.856 rows=204494)
 Total runtime: 368.296 ms
(3 rows)

# SET enable_seqscan = on;
SET

Wow! Big improvement. Lets see how that compares to GROUP BY with sequential scans turned off:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Group (actual time=0.014..362.013 rows=20799)
   ->  Index Scan (actual time=0.011..267.001 rows=204494)
 Total runtime: 368.201 ms
(3 rows)

# SET enable_seqscan = on;
SET

So, at this point, the two queries are comparable. Tweaking the configuration sometimes causes PostgreSQL to choose a bad plan on other queries, so it seems like if PostgreSQL is choosing a bad plan for a query like this, you can simply turn sequential scans off for it.