Merging Tables To Kill JOINs

June 25th, 2009

I was working on this project that has several large tables, and many JOINs. It had reached the point where the JOINs were becoming too costly, but fortunately, some of the tables could be merged. The problem, though, was figuring out how to get the data merged. I came up with a solution that uses views and rules… Let’s say there’s 2 tables, “main_table” and “tbl1″, and “tbl1″ needs to be merged with “main_table”:

-- Create new columns in main_table
ALTER TABLE main_table ADD COLUMN col1 VARCHAR(32);
ALTER TABLE main_table ADD COLUMN col2 VARCHAR(32);

-- Create view for tbl1
CREATE VIEW tbl1_view AS SELECT * FROM tbl1;

-- Create rule for view, tbl1_view
CREATE RULE tbl1_view_upd AS ON UPDATE TO tbl1_view DO INSTEAD (
    UPDATE main_table SET col1 = NEW.col1, col2 = NEW.col2
        WHERE id = NEW.refid;
);

The view comes in useful, because it doesn’t affect updates to tbl1 while you’re working. Now, if you simply do an update to the view, it will update the newly created columns in “main_table” to reflect the ones in “tbl1″. The view and old table can then be safely dropped once the data is merged.

UPDATE tbl1_view SET id = id;
DROP VIEW tbl_view;
DROP TABLE tbl1;

Another view with rules can then be created to represent the data that was previously in “tbl1″ and redirect writes to “main_table”, so things don’t break while backend code is being updated. My solution turned out to be a simple way to merge the tables. I would definitely consider a different approach if you’re working with millions of rows, though.

PHP: The Problem With is_callable() And __call()

April 22nd, 2009

So I ran into this problem with PHP: when you have a class that uses method overloading via __call(), is_callable() will always return true for any function given to it, no matter if it exists or not. I know this is by design, but it’s still heavily brain-damaged. Here’s an example of what I’m talking about:


class foo {

    public function __call($name, $args)
    {
        echo "$name called\n";
    }
}

$foo = new foo();
if (is_callable(array($foo, 'non_existent_method'))) {
    echo "foo->non_existant_method() is callable\n";
}

My whole problem with this, is it’s completely undocumented. If you use method_exists(), you may expect it to return true for method overloading, but that’s not the case.

So:
is_callable() - Recognizes __call()
method_exists() - Doesn’t recognize __call()

More On Comcast’s Move To Digital

April 9th, 2009

Comcast’s plan to transition to digital is well on it’s way in Seattle/Tacoma. Infrastructure is being upgraded in many areas to increase HDTV capacity, which mostly involves upgrading old systems to an 860Mhz system. In some areas, analog channels are already in the process of being shut off. They’re not shut off yet; currently they display a screen with a phone number to call to order a digital cable box, but they’re sure to be shut off in the near future.

The space freed by the analog channels will be used primarily for HDTV. This is something Comcast desperately needs in order to compete with DirecTV and Verizon FiOS. Brier Dudley of the Seattle Times, posted a blog entry that explains which areas are expected to have their analog shut off.

Asking Who For What?

April 2nd, 2009

I recently stumbled across this gem, from a website called Minda News… I think the Moro Islamic Liberation Front should consider a name change. Click the image for full view.

Moro Islamic Liberation Front (MILF)

Note that “cell phone” is also horribly misspelled.

PostgreSQL AGE() Function Not Indexable

February 9th, 2009

I was working a project that needed to be able to look up records, from a certain time frame. Interestingly enough, I discovered that the AGE() function cannot use indexes. My workaround, is to create a functional index that extracts the unix timestamp from the TIMESTAMP data type, and uses a BETWEEN clause.

So:

SELECT * FROM my_table
WHERE AGE(NOW(), my_timestamp) <= '1 day'::interval;

Becomes:

CREATE INDEX my_table_epoch_idx ON my_table
  (EXTRACT(epoch FROM my_timestamp));

SELECT * FROM my_table WHERE EXTRACT(epoch FROM my_timestamp)
  BETWEEN EXTRACT(epoch FROM NOW()) - 86400
  AND EXTRACT(epoch FROM NOW());

Both queries fetch items where my_timestamp is less than 24 hours old, but the latter has the ability to use an index.

PostgreSQL And SELECT DISTINCT On Large Tables

January 22nd, 2009

I’ve been noticing lately that PostgreSQL’s SELECT DISTINCT operation seems to be un-optimized for large tables. Take a look:

# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
 Unique (time=3393.085..3615.420 rows=20799 loops=1)
 ->  Sort  (time=3393.082..3514.374 rows=204494 loops=1)
       Sort Key: mycol
       ->  Seq Scan (time=9.455..289.479 rows=204494)
 Total runtime: 3626.724 ms
(5 rows)

Now compare DISTINCT to GROUP BY:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol;
QUERY PLAN
------------------------
 HashAggregate (time=415.129..427.497 rows=20799)
   ->  Seq Scan (time=8.127..269.190 rows=204494)
 Total runtime: 433.836 ms
(3 rows)

That’s quite an improvement. 3626.724 ms compared to 433.836 ms. Even with ordering, GROUP BY is still faster:

# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Sort  (time=642.126..654.545 rows=20799 loops=1)
   Sort Key: mycol
   ->  HashAggregate  (time=408.922..423.040 rows=20799)
         ->  Seq Scan (time=7.519..263.577 rows=204494)
 Total runtime: 661.537 ms
(5 rows)

The main problem seems to be that PostgreSQL is choosing a bad query plan. Lets try forcing PostgreSQL to use an index scan on SELECT DISTINCT:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT DISTINCT mycol FROM mytable;
QUERY PLAN
------------------------
Unique (actual time=0.013..361.917 rows=20799 loops=1)
   ->  Index Scan (actual time=0.012..266.856 rows=204494)
 Total runtime: 368.296 ms
(3 rows)

# SET enable_seqscan = on;
SET

Wow! Big improvement. Lets see how that compares to GROUP BY with sequential scans turned off:

# SET enable_seqscan = off;
SET
# EXPLAIN ANALYZE SELECT mycol FROM mytable GROUP BY mycol
ORDER BY mycol;
QUERY PLAN
------------------------
 Group (actual time=0.014..362.013 rows=20799)
   ->  Index Scan (actual time=0.011..267.001 rows=204494)
 Total runtime: 368.201 ms
(3 rows)

# SET enable_seqscan = on;
SET

So, at this point, the two queries are comparable. Tweaking the configuration sometimes causes PostgreSQL to choose a bad plan on other queries, so it seems like if PostgreSQL is choosing a bad plan for a query like this, you can simply turn sequential scans off for it.

PostGIS: Bounding Box Indexing

December 19th, 2008

You may have seen from my previous post, that I’ve been playing around with PostGIS. I had some more time to dig deeper into it, and I found that, although functions like distance_sphere/ST_distance_sphere aren’t indexable, bounding-box lookups are. Bounding box indexing can be used in conjunction with those functions, to improve the performance of queries, fairly easily.

This only works for GiST indexes, which can be confusing if you’re not familiar with indexes in PostgreSQL. Say the GEOMETRY column you want to index is geom on the table zip. Basically, you just do:

CREATE INDEX zip_geom_idx ON zip USING GIST (geom);

For searching for points that are within a certain radius of another, PostGIS provides a function called expand/ST_expand, which expands a single point to a box, which can then be used in an index lookup.


The queries essentially break down to:

SELECT ... FROM ... WHERE
geom && expand('some point', 1)
AND distance_sphere('some point', geom) <= (15 * 1609.344);

The first part of the WHERE clause, geom && expand(’some point’, 1), expands “some point” 1 degree in all directions to create a box, and finds points that are inside or on the border of the expanded box. This is the part that indexing is used on.

The second part of the WHERE clause, distance_sphere(’some point’, geom) <= (15 * 1609.344), refines the selection (the rows that were found in the index scan), down to points that are within 15 miles of 'some point'.

I usually just use subqueries and a join, rather than doing multiple fetches, so in my case it looks more like:

SELECT t.* FROM mytable t
JOIN zip z ON (z.zip = t.zip)
WHERE z.geom && expand(
    (SELECT geom FROM zip WHERE zip = '10001'), 1)
AND distance_sphere(geom,
    (SELECT geom FROM zip WHERE zip = '10001')) <= (15 * 1609.344);

Versus / Golf Split, And Comcast Phasing Out Analog

December 8th, 2008

The Versus / Golf split is complete. For Comcast, in Seattle / Tacoma, the new channels are:

  • 625 Golf HD
  • 626 Versus HD 

The former hybrid channel Versus/Golf, is pending removal.

Also I read an interesting article about Comcast shutting off analog in Seattle / Tacoma (link). Basically, it says channels 2-29 will remain broadcasting in analog for the time being, but everything over 29 will be phased out over the next year. They could start the transition as soon as sometime this month, around the time they roll out their new Wideband Internet service. They are also lowering the price of their Digital Cable service, to the price of the current Extended-Basic service (the analog package, which is being phased out).