PostgreSQL And SELECT DISTINCT On Large Tables

I’ve been noticing lately that PostgreSQL’s SELECT DISTINCT operation seems to be un-optimized for large tables. Take a look:

# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
 Unique (time=3393.085..3615.420 rows=20799 loops=1)
 ->  Sort  (time=3393.082..3514.374 rows=204494 loops=1)
       Sort Key: mycol
       ->  Seq Scan (time=9.455..289.479 rows=204494)
 Total runtime: 3626.724 ms
(5 rows)

Now compare DISTINCT to GROUP BY:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol;
QUERY PLAN
------------------------
 HashAggregate (time=415.129..427.497 rows=20799)
   ->  Seq Scan (time=8.127..269.190 rows=204494)
 Total runtime: 433.836 ms
(3 rows)

That’s quite an improvement. 3626.724 ms compared to 433.836 ms. Even with ordering, GROUP BY is still faster:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Sort  (time=642.126..654.545 rows=20799 loops=1)
   Sort Key: mycol
   ->  HashAggregate  (time=408.922..423.040 rows=20799)
         ->  Seq Scan (time=7.519..263.577 rows=204494)
 Total runtime: 661.537 ms
(5 rows)

The main problem seems to be that PostgreSQL is choosing a bad query plan. Lets try forcing PostgreSQL to use an index scan on SELECT DISTINCT:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
Unique (actual time=0.013..361.917 rows=20799 loops=1)
   ->  Index Scan (actual time=0.012..266.856 rows=204494)
 Total runtime: 368.296 ms
(3 rows)

# SET enable_seqscan = on;
SET

Wow! Big improvement. Lets see how that compares to GROUP BY with sequential scans turned off:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Group (actual time=0.014..362.013 rows=20799)
   ->  Index Scan (actual time=0.011..267.001 rows=204494)
 Total runtime: 368.201 ms
(3 rows)

# SET enable_seqscan = on;
SET

So, at this point, the two queries are comparable. Tweaking the configuration sometimes causes PostgreSQL to choose a bad plan on other queries, so it seems like if PostgreSQL is choosing a bad plan for a query like this, you can simply turn sequential scans off for it.

2 Responses to “PostgreSQL And SELECT DISTINCT On Large Tables”

  1. Ivan Pavlov Says:

    Hmph. Have you run VACUUM ANALYZE on this table before comparing query plans?

  2. admin Says:

    Yeah, the problem was PostgreSQL’s cost estimate was way off, so it never attempted an index scan. After forcing it to use an index, and running VACUUM ANALYZE, it started choosing an index scan over a seqscan.

Leave a Reply