PostgreSQL Slow Count() Workaround
You may also be interested in this. It’s a far more elegant solution, if you only need the entire number of rows in a table.
MySQL zealots tend to point to PostgreSQL’s slow count() as a weakness, however, in the real world, count() isn’t used very often, and if it’s really needed, most good database systems provide a framework for you to build a workaround. The reason MySQL has fast counts, is because, they’re cached. This works in MySQL because MySQL uses serialized INSERTs, but higher grade DBs are transactional, and PostgreSQL uses MVCC, so caching row counts produces inaccurate results.
Here’s an example count() workaround for PostgreSQL, using a combination of PL/PgSQL and TRIGGERs. You could certainly extend this model, adding more columns to the cached rows table, so you can simulate SELECT count(id) WHERE foo = ‘bar’, although you’d have to add a TRIGGER for UPDATEs.
--- This table will be used as a dummy table,
--- that we wan't to get a count() from.
CREATE TABLE tbl1 (
id INTEGER
);
--- This table is where the cached counts will be stored
CREATE TABLE tbl1_num_rows (
num_rows INTEGER NOT NULL
);
--- Function to update the count in tbl1_num_rows,
--- on an INSERT
CREATE OR REPLACE FUNCTION tbl1_up_rows_insert ()
RETURNS TRIGGER AS $$
DECLARE
row RECORD;
BEGIN
--- Auto-add record
SELECT INTO row num_rows FROM tbl1_num_rows;
IF NOT FOUND THEN
INSERT INTO tbl1_num_rows VALUES (0);
END IF;
UPDATE tbl1_num_rows SET num_rows =
((SELECT num_rows FROM tbl1_num_rows) + 1);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
--- Function to update the count in tbl1_num_rows,
--- on a DELETE
CREATE OR REPLACE FUNCTION tbl1_up_rows_delete ()
RETURNS TRIGGER AS $$
DECLARE
row RECORD;
BEGIN
--- Auto-add record
SELECT INTO row num_rows FROM tbl1_num_rows;
IF NOT FOUND THEN
INSERT INTO tbl1_num_rows VALUES (0);
ELSE
IF row.num_rows > 0 THEN
UPDATE tbl1_num_rows SET num_rows =
((SELECT num_rows FROM tbl1_num_rows) - 1);
END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
--- Add the TRIGGERS to tbl1
CREATE TRIGGER tbl1_up_rows_on_insert AFTER INSERT ON tbl1 FOR EACH ROW
EXECUTE PROCEDURE tbl1_up_rows_insert();
CREATE TRIGGER tbl1_up_rows_on_delete AFTER DELETE ON tbl1 FOR EACH ROW
EXECUTE PROCEDURE tbl1_up_rows_delete();

December 20th, 2007 at 5:37 am
Hi! I’m currently learning PL/PGSQL for the very same issue as well. It should be possible to reduce your code into one function by making use of the variable TG_OP which is on of INSERT, UPDATE or DELETE (depending on your create function definition):
IF TG_OP = ‘INSERT’ THEN
… do the insert stuff
ELSE
… do the delete stuff
END IF;
Furthermore in your delete function, you’re fetching the row record twice. You could use the first row record to determine whether num_rows is already at 0 (if found) and then skipping the update statement at all. Thereby the second select into row statement wouldn’t be needed:
SELECT INTO row num_rows …
IF NOT FOUND THEN
INSERT …
ELSE
IF row.num_rows > 0 THEN
UPDATE tbl1_num_rows SET num_rows = ((SELECT …
END IF;
END IF;
Yours,
Lars
December 20th, 2007 at 10:19 am
Thanks for the suggestions, I cleaned up the delete function! Like I said in the post, it’s merely an example. I wrote the entire thing in like 10 minutes
If I was writing this for an actual production database, I would certainly clean it up.
I actually wrote a cool little system, using PL/PGSQL and TRIGGERS just last week, to keep track of ratings using the same basic logic. It’s pretty damn cool, it basically works like this:
new_rating = (((current_rating * number_of_votes) + submitted_rating) / (number_of_votes + 1))
It keeps an average rating, along with the number of votes, cached in a table without even touching the actual ratings table.
January 29th, 2008 at 8:21 am
A simpler solution consists on using VACUUM and the catalog tables.
SELECT reltuples FROM pg_tables WHERE relname = ‘tbl1′
You have to VACUUM ANALYZE the table in question in order to update the catalog.
January 29th, 2008 at 8:23 am
Wrong query, it has to say:
SELECT reltuples FROM pg_class WHERE relname = ‘tbl1′ AND relkind = ‘r’
January 29th, 2008 at 9:46 am
Ricardo,
Yes, I am aware of that, and if you look around my blog, you’ll see that query at least once.
However, there’s several problems with using that method:
1) The row count isn’t updated until you run VACUUM ANALYZE, so it can’t be used for anything that requires the exact number of rows. Also, I have projects where people add/delete rows every couple minutes, so that produces wildly inaccurate results.
2) That query can’t be used to emulate a “WHERE foo = ‘bar’” query, ie, you can’t do something similar to SELECT COUNT(*) AS num_rows FROM tbl1 WHERE foo = ‘bar’, whereas with the other method, it can be extended to do that fairly easily.
I do use that query sometimes, it’s just not feasible for all situations.
March 22nd, 2008 at 12:02 am
I read on http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL (which is how I got here) that “it goes through the entire table sequentially”.
Can you give a database idiot (that would be me) an idea of what quantity of information it’s pulling from disk? Is it pulling in headers, or PKs from each record, or does COUNT(*) read the entire contents of the table from disk?
thanks!
March 22nd, 2008 at 8:20 am
Ken,
It depends on the situation. If you give it a qualifier (ie. WHERE col1 = ‘foo’), then it will read an index, however, it will still have to check each row to see if it’s visible to the current transaction, since that information isn’t stored in index, plus if the WHERE qualifies too many rows, say 600, then it will drop down to a sequential scan. If you do simply SELECT COUNT(*) FROM table, then it will perform a sequential scan on the entire table (see below).
There’s essentially two ways to read information from a table, you can do a sequential scan, or an index scan. What a sequential scan does, is scan every single row in the table, which as you may expect, is a lot of information and thus causes the query to be slow. The other type of scan is an index scan, which reads the information from an index, which is a lot smaller than the actual table, and is much faster. For more information on index scans and sequential scans, wikipedia is your friend.
Also, note that not all tables will need a trigger to emulate COUNT(*). In my experience, for Web Applications, as long as you have under 2000-8000 rows (depending on the server), you’re in the clear, and COUNT(*) really shouldn’t be an issue. Tip: use EXPLAIN ANALYZE to see if you really need a trigger. As long as you don’t go over ~10-20ms for the query, there shouldn’t be a problem for a Web Application. FYI, I have a 3000 row table that takes ~1.2ms to scan with COUNT(*), there’s never been a problem with speed in the place it’s used, and all of the queries combined take ~15ms. Typically, there’s only a problem once you start getting more than 10,000 rows.
June 19th, 2008 at 3:29 pm
I couldn’t understand some parts of this article PostgreSQL Slow Count() Workaround, but I guess I just need to check some more resources regarding this, because it sounds interesting.
May 12th, 2009 at 3:07 pm
MySQL also supports triggers which could be used to create a summary/aggregates table such as you have done. This isn’t really a PostrgreSQL vs MySQL issue. It’s a MVCC vs non-MVCC issue. InnoDB which uses MVCC similarly suffers from the same behavior as PostgreSQL where MyISAM and Infobright do not.
May 12th, 2009 at 3:21 pm
@Matthew
I have used MySQL for many projects, and I’m aware of the differences between engines, features like triggers, etc. However, when I wrote this, I was frustrated by people that kept pointing to the slowness of COUNT(*) on PostgreSQL as a major weakness (which is can be when not handled properly), but as a developer, you also have the tools to workaround it.
It really doesn’t matter what database you use, as long as you have developers that understand it and can make it scale.
August 5th, 2009 at 4:57 am
Hello All,
What is the difference between count(*) and count(1)? As the latter is the recommendation of Oracle to count rows in a table. Does it matter on PostgreSql?
August 5th, 2009 at 9:53 am
@Zoltán
Good question. From what I understand, there is no difference. I ran a couple benchmarks that seem to confirm this. I’m almost certain COUNT(*) and COUNT(1) do the exact same thing internally, and the only thing that’s different is the syntax.
September 23rd, 2009 at 9:31 am
Hi
dummy’s question.
is this strictly a problem of the count() function, or would for instance the function sum() have the same drawback ?
if first assumption is OK, why not add a small int column to the tables which always have 1 as a value, then sum it ?
Thx
September 23rd, 2009 at 11:31 am
@cdore
SUM() would have the same problem as COUNT(), if you’re reading all the rows in a table, as it would perform a sequential scan, which is what you want to avoid.
The count problem can be solved by having a separate table, and keeping row counts there. This can be done either inside the application, or by using triggers, like in my example.
The flexibility of this method is near endless, and not limited to COUNT(). I’ve used it in ratings systems, to track the average rating for items, basically doing the same thing but with AVG() instead of COUNT().
Essentially, what this is, is a type of materialized view. You’re taking a costly aggregate function, and taking the performance cost out if it, by simulating it’s logic and storing the result in a single row of a table.
January 15th, 2010 at 10:17 am
As was mentioned, creating an aggregate is frequently useful for many purposes. I’m lazy and my application is still in the early stages so I didn’t want to get wrapped up in maintaining triggers yet. Also, my application doesn’t require up-to-the-minute accurate estimates. I can update my counts every few days. I’m using a single aggregate query to create a table with counts in it.
However, I needed to be able to do a “SELECT count(*) WHERE field=’x’ ” instead of a simple “SELECT count (*)”. Here is what I created.
CREATE TABLE aggregate AS (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC) ;
futher, my data contains many one-offs which I don’t want to put into my aggregate table (I assume that if a value isn’t in my aggregate table that it’s too small to be useful)… here is what I do to leave those values out:
CREATE TABLE aggregate AS (SELECT * FROM (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC) AS countquery WHERE count > 3);
This method is fast enough. Because I have an index on the “result” field, It runs in under a minute against 3 million records. It takes approximately the same amount of time to generate counts for all the result values as it does to count the total number of records.
January 15th, 2010 at 12:34 pm
@Nick
Materialized views are very useful in your situation, which is pretty much what you’re using right now, though normally I actually create a couple views to materialize the table from:
CREATE VIEW salary_summary_view AS SELECT SUM(amount) AS total, employee_id GROUP BY employee_id;
CREATE TABLE salary_summary AS SELECT * FROM table_summary_view;
It makes things easier to maintain, and you can always fall back on the view if you need up-to-the-second results. How are you maintaining the table? Do you have a script that runs every few days?
Triggers may or may not be a good thing when dealing with large datasets, since there’s added overhead to all write operations (INSERT, DELETE, etc). It depends if your server can handle the added overhead. Once you get to a certain point, though, there really isn’t any off-the-shelf solutions.