Comments on: PostgreSQL Slow Count() Workaround http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/ On Databases, Recovery, Tech Mon, 06 Sep 2010 10:05:18 +0000 http://wordpress.org/?v=2.2.1 By: admin http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-11485 admin Fri, 15 Jan 2010 19:34:24 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-11485 @Nick Materialized views are very useful in your situation, which is pretty much what you're using right now, though normally I actually create a couple views to materialize the table from: CREATE VIEW salary_summary_view AS SELECT SUM(amount) AS total, employee_id GROUP BY employee_id; CREATE TABLE salary_summary AS SELECT * FROM table_summary_view; It makes things easier to maintain, and you can always fall back on the view if you need up-to-the-second results. How are you maintaining the table? Do you have a script that runs every few days? Triggers may or may not be a good thing when dealing with large datasets, since there's added overhead to all write operations (INSERT, DELETE, etc). It depends if your server can handle the added overhead. Once you get to a certain point, though, there really isn't any off-the-shelf solutions. @Nick

Materialized views are very useful in your situation, which is pretty much what you’re using right now, though normally I actually create a couple views to materialize the table from:

CREATE VIEW salary_summary_view AS SELECT SUM(amount) AS total, employee_id GROUP BY employee_id;

CREATE TABLE salary_summary AS SELECT * FROM table_summary_view;

It makes things easier to maintain, and you can always fall back on the view if you need up-to-the-second results. How are you maintaining the table? Do you have a script that runs every few days?

Triggers may or may not be a good thing when dealing with large datasets, since there’s added overhead to all write operations (INSERT, DELETE, etc). It depends if your server can handle the added overhead. Once you get to a certain point, though, there really isn’t any off-the-shelf solutions.

]]>
By: Nick Roberts http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-11482 Nick Roberts Fri, 15 Jan 2010 17:17:52 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-11482 As was mentioned, creating an aggregate is frequently useful for many purposes. I'm lazy and my application is still in the early stages so I didn't want to get wrapped up in maintaining triggers yet. Also, my application doesn't require up-to-the-minute accurate estimates. I can update my counts every few days. I'm using a single aggregate query to create a table with counts in it. However, I needed to be able to do a "SELECT count(*) WHERE field='x' " instead of a simple "SELECT count (*)". Here is what I created. CREATE TABLE aggregate AS (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC) ; futher, my data contains many one-offs which I don't want to put into my aggregate table (I assume that if a value isn't in my aggregate table that it's too small to be useful)... here is what I do to leave those values out: CREATE TABLE aggregate AS (SELECT * FROM (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC) AS countquery WHERE count > 3); This method is fast enough. Because I have an index on the "result" field, It runs in under a minute against 3 million records. It takes approximately the same amount of time to generate counts for all the result values as it does to count the total number of records. As was mentioned, creating an aggregate is frequently useful for many purposes. I’m lazy and my application is still in the early stages so I didn’t want to get wrapped up in maintaining triggers yet. Also, my application doesn’t require up-to-the-minute accurate estimates. I can update my counts every few days. I’m using a single aggregate query to create a table with counts in it.

However, I needed to be able to do a “SELECT count(*) WHERE field=’x’ ” instead of a simple “SELECT count (*)”. Here is what I created.

CREATE TABLE aggregate AS (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC) ;

futher, my data contains many one-offs which I don’t want to put into my aggregate table (I assume that if a value isn’t in my aggregate table that it’s too small to be useful)… here is what I do to leave those values out:

CREATE TABLE aggregate AS (SELECT * FROM (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC) AS countquery WHERE count > 3);

This method is fast enough. Because I have an index on the “result” field, It runs in under a minute against 3 million records. It takes approximately the same amount of time to generate counts for all the result values as it does to count the total number of records.

]]>
By: admin http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-8058 admin Wed, 23 Sep 2009 18:31:39 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-8058 @cdore SUM() would have the same problem as COUNT(), if you're reading all the rows in a table, as it would perform a sequential scan, which is what you want to avoid. The count problem can be solved by having a separate table, and keeping row counts there. This can be done either inside the application, or by using triggers, like in my example. The flexibility of this method is near endless, and not limited to COUNT(). I've used it in ratings systems, to track the average rating for items, basically doing the same thing but with AVG() instead of COUNT(). Essentially, what this is, is a type of materialized view. You're taking a costly aggregate function, and taking the performance cost out if it, by simulating it's logic and storing the result in a single row of a table. @cdore

SUM() would have the same problem as COUNT(), if you’re reading all the rows in a table, as it would perform a sequential scan, which is what you want to avoid.

The count problem can be solved by having a separate table, and keeping row counts there. This can be done either inside the application, or by using triggers, like in my example.

The flexibility of this method is near endless, and not limited to COUNT(). I’ve used it in ratings systems, to track the average rating for items, basically doing the same thing but with AVG() instead of COUNT().

Essentially, what this is, is a type of materialized view. You’re taking a costly aggregate function, and taking the performance cost out if it, by simulating it’s logic and storing the result in a single row of a table.

]]>
By: cdore http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-8056 cdore Wed, 23 Sep 2009 16:31:16 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-8056 Hi dummy's question. is this strictly a problem of the count() function, or would for instance the function sum() have the same drawback ? if first assumption is OK, why not add a small int column to the tables which always have 1 as a value, then sum it ? Thx Hi

dummy’s question.

is this strictly a problem of the count() function, or would for instance the function sum() have the same drawback ?

if first assumption is OK, why not add a small int column to the tables which always have 1 as a value, then sum it ?

Thx

]]>
By: admin http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-7071 admin Wed, 05 Aug 2009 16:53:08 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-7071 @Zoltán Good question. From what I understand, there is no difference. I ran a couple benchmarks that seem to confirm this. I'm almost certain COUNT(*) and COUNT(1) do the exact same thing internally, and the only thing that's different is the syntax. @Zoltán

Good question. From what I understand, there is no difference. I ran a couple benchmarks that seem to confirm this. I’m almost certain COUNT(*) and COUNT(1) do the exact same thing internally, and the only thing that’s different is the syntax.

]]>
By: Zoltán Fekete http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-7064 Zoltán Fekete Wed, 05 Aug 2009 11:57:53 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-7064 Hello All, What is the difference between count(*) and count(1)? As the latter is the recommendation of Oracle to count rows in a table. Does it matter on PostgreSql? Hello All,

What is the difference between count(*) and count(1)? As the latter is the recommendation of Oracle to count rows in a table. Does it matter on PostgreSql?

]]>
By: admin http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-4544 admin Tue, 12 May 2009 22:21:44 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-4544 @Matthew I have used MySQL for many projects, and I'm aware of the differences between engines, features like triggers, etc. However, when I wrote this, I was frustrated by people that kept pointing to the slowness of COUNT(*) on PostgreSQL as a major weakness (which is can be when not handled properly), but as a developer, you also have the tools to workaround it. It really doesn't matter what database you use, as long as you have developers that understand it and can make it scale. @Matthew

I have used MySQL for many projects, and I’m aware of the differences between engines, features like triggers, etc. However, when I wrote this, I was frustrated by people that kept pointing to the slowness of COUNT(*) on PostgreSQL as a major weakness (which is can be when not handled properly), but as a developer, you also have the tools to workaround it.

It really doesn’t matter what database you use, as long as you have developers that understand it and can make it scale.

]]>
By: Matthew Montgomery http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-4543 Matthew Montgomery Tue, 12 May 2009 22:07:36 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-4543 MySQL also supports triggers which could be used to create a summary/aggregates table such as you have done. This isn't really a PostrgreSQL vs MySQL issue. It's a MVCC vs non-MVCC issue. InnoDB which uses MVCC similarly suffers from the same behavior as PostgreSQL where MyISAM and Infobright do not. MySQL also supports triggers which could be used to create a summary/aggregates table such as you have done. This isn’t really a PostrgreSQL vs MySQL issue. It’s a MVCC vs non-MVCC issue. InnoDB which uses MVCC similarly suffers from the same behavior as PostgreSQL where MyISAM and Infobright do not.

]]>
By: Aaron http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-485 Aaron Thu, 19 Jun 2008 22:29:38 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-485 I couldn't understand some parts of this article PostgreSQL Slow Count() Workaround, but I guess I just need to check some more resources regarding this, because it sounds interesting. I couldn’t understand some parts of this article PostgreSQL Slow Count() Workaround, but I guess I just need to check some more resources regarding this, because it sounds interesting.

]]>
By: admin http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-254 admin Sat, 22 Mar 2008 15:20:25 +0000 http://blog.charcoalphile.com/2007/12/12/postgresql-count-workaround/#comment-254 Ken, It depends on the situation. If you give it a qualifier (ie. WHERE col1 = 'foo'), then it will read an index, however, it will still have to check each row to see if it's visible to the current transaction, since that information isn't stored in index, plus if the WHERE qualifies too many rows, say 600, then it will drop down to a sequential scan. If you do simply SELECT COUNT(*) FROM table, then it will perform a sequential scan on the entire table (see below). There's essentially two ways to read information from a table, you can do a sequential scan, or an index scan. What a sequential scan does, is scan every single row in the table, which as you may expect, is a lot of information and thus causes the query to be slow. The other type of scan is an index scan, which reads the information from an index, which is a lot smaller than the actual table, and is much faster. For more information on index scans and sequential scans, wikipedia is your friend. Also, note that not all tables will need a trigger to emulate COUNT(*). In my experience, for Web Applications, as long as you have under 2000-8000 rows (depending on the server), you're in the clear, and COUNT(*) really shouldn't be an issue. Tip: use EXPLAIN ANALYZE to see if you really need a trigger. As long as you don't go over ~10-20ms for the query, there shouldn't be a problem for a Web Application. FYI, I have a 3000 row table that takes ~1.2ms to scan with COUNT(*), there's never been a problem with speed in the place it's used, and all of the queries combined take ~15ms. Typically, there's only a problem once you start getting more than 10,000 rows. Ken,

It depends on the situation. If you give it a qualifier (ie. WHERE col1 = ‘foo’), then it will read an index, however, it will still have to check each row to see if it’s visible to the current transaction, since that information isn’t stored in index, plus if the WHERE qualifies too many rows, say 600, then it will drop down to a sequential scan. If you do simply SELECT COUNT(*) FROM table, then it will perform a sequential scan on the entire table (see below).

There’s essentially two ways to read information from a table, you can do a sequential scan, or an index scan. What a sequential scan does, is scan every single row in the table, which as you may expect, is a lot of information and thus causes the query to be slow. The other type of scan is an index scan, which reads the information from an index, which is a lot smaller than the actual table, and is much faster. For more information on index scans and sequential scans, wikipedia is your friend.

Also, note that not all tables will need a trigger to emulate COUNT(*). In my experience, for Web Applications, as long as you have under 2000-8000 rows (depending on the server), you’re in the clear, and COUNT(*) really shouldn’t be an issue. Tip: use EXPLAIN ANALYZE to see if you really need a trigger. As long as you don’t go over ~10-20ms for the query, there shouldn’t be a problem for a Web Application. FYI, I have a 3000 row table that takes ~1.2ms to scan with COUNT(*), there’s never been a problem with speed in the place it’s used, and all of the queries combined take ~15ms. Typically, there’s only a problem once you start getting more than 10,000 rows.

]]>