Archive for the ‘PHP’ Category

Aggressive Caching With Memcached

Sunday, August 17th, 2008

There’s many different methods to use caching, with memcached. I call this one the Aggressive approach.

The basic logic is this: if you don’t even touch the database, you will be able to serve far more requests, than if you did. The biggest downside is, you can’t really do things too dynamically, and really, this only works well for flat pages (ie., it’s not for anything that displays content, based on the user).

In my case, I added the following code (shown as pseudo code) to the main/controller/dispatch script:


$memcache = new Memcache;
$memcache->connect(...);
$data = $memcache->get($_SERVER['REQUEST_URI'])

if ($data) {
    echo $data;
    die();
 }


$db = new PDO(...);
ob_start();
ob_implicit_flush(false);
...
... do whatever ...
...
$data = ob_get_contents();
$memcache->set($_SERVER['REQUEST_URI'], $data, 0, 3600);
ob_end_flush(); 

One of the biggest downsides to this approach, is figuring out how to delete cached objects when something’s updated. If you have a single table that holds the pages that you’re caching, however, you can just add a hook that deletes the cached object, by URI, when anything is written that would affect it.

This really doesn’t work well for anything super dynamic, but it’s great for increasing performance in areas that would otherwise waste resources.

Speeding Up PHP Applications

Friday, July 4th, 2008

Since everyone else seems to have a list of how to speed up PHP Apps, I thought it was about time I made one. Here’s some basic rules for speeding up PHP Applications:

  • Avoid including large files/classes. The more PHP has to process, the slower it is. A lot of frameworks have needlessly large classes. It’s recommended that you split things up into smaller “modules” and include them as needed. Also, only include what you need. PHP has to process everything, even the stuff you don’t use. With some sites, I’ve seen the number requests per second increase, 4 times, simply by moving away from needlessly large frameworks and unneeded includes.
  • Use PHP’s built-in functions whenever you can. PHP has a lot of functions for working with strings, arrays, etc. A lot of times, they can save you headaches, and improve performance over user created functions.
  • Avoid Regular Expressions. Instead, use functions like str_replace, in place of preg_replace, or strstr, in place of preg_match, when possible.
  • Be careful of the way you reference objects. PHP is known to have trouble with garbage collection under certain situations, such as when you’re dealing with classes that are circular-referenced. These types of situations usually go unnoticed, until PHP start exceeding it’s memory limit.
  • Avoid functions like file_get_contents(), and file_put_contents(), when working with large files… Unless you want to run out of memory.
  • Use memcached, aggressively if possible. Hmm, lets see, 100 requests/second vs 1000 requests/second… On the same server. It’s a no brainer.
  • Don’t ever use persistent connections. They may seem like a good idea, and they may appear to make your app faster in benchmarks, but it will severely cripple your ability to handle high traffic… Let me explain, persistent connections and connection pooling aren’t the same. Persistent connections are per user, and users will often end up with 2 or 3 persistent connections open for them. Nobody else can use these connections; this leads to your server running out of database connections with a relatively small number of users (think 40-60). If you really need this functionality, use an actual connection pooler.
  • Avoid making your database grumpy; most bottlenecks are related to the database. This typically means avoiding sequential scans, aggregate functions, etc. With MySQL, avoid complex JOINs, as well as date range scans, and “WHERE column IN (SELECT …)” queries. With PostgreSQL, avoid “SELECT COUNT(*)”, and keep the total number of queries to a minimum.
  • Get more servers. Single servers can only handle so many requests per second. When you get to a point where your server is struggling under the load, and there’s no clear bottlenecks, the only thing you can do is scale to your needs.

PHP PDO and Bytea/Blob Columns

Friday, June 27th, 2008

I’ve searched high and low, but haven’t found any solid documentation on how to work with BLOB/BYTEA columns with PDO. Here’s what I’ve figured out.

By default, when using the query method, PDO will return a Resource ID for binary objects. The Resource ID must be used together with fread(). So, ie:


$c = $db->query('SELECT blob FROM table WHERE id = 21');
$res = $c->fetch(PDO::FETCH_ASSOC);
$buf = null;

while (!feof($res['blob'])) {
    $buf .= fread($res['blob'], 2048);
}

fclose($res['blob']);

You can also fetch binary objects as a string, by binding the column with PDO::PARAM_STR:

$buf = null;
$st = $db->query('SELECT blob FROM table WHERE id = 21');
$st->bindColumn('blob', $buf, PDO::PARAM_STR);
$st->fetch();

It’s hard to believe there’s no documentation on this. I really hope I just missed it; this would be a pretty big thing to have no documentation on.

Update: The documentation is here, hidden by obscurity. It sounds like it’s talking about handling external large objects, and not internal blob columns, inside the table.

Improving Dynamic Images [Theoretically]

Thursday, March 20th, 2008

Something I’ve been thinking about lately, is how to improve the performance of images served from a database. The model I’ve deployed in the past, looks like this:

  • Apache Rewrites all requests to /Images to /image-script.php?image=$1
  • The script then includes all base files, some not necessary, and performs a sanity check.
  • The script grabs the meta-info from the database, for the requested file.
  • An ETag, as well as other HTTP headers, such as Cache-Control, Content-Disposition, etc, are generated.
  • The script checks for a cached copy of the data (using Cache_Lite), and if it’s not found or it’s expired, it fetches the image data from the database, and caches it.
  • Either the headers and data is sent, or a 304 Not Modified is sent.

The flaw with this design: things are included, but not used, and the database is touched for every request. The new design that I’m thinking about would be a very minimalistic class, and it would only access the database if the cache doesn’t exist, or if it’s expired. It would look more like this:

  • Class is initialize, and sanitizes all input.
  • When called upon, the class checks the cache for the serialized database result, and unserialize()’s it if it’s found.
  • If not found, it opens the database, fetches the data, then caches the serialize()’d result.
  • Either a 304 Not Modified is then sent, or the headers are generated and sent along with the data.
  • The backend administration panel clears the cache [per virtual file] when UPDATEs or DELETEs occur. This is to avoid having an updated image not hit live until the cache is expired (the previous model used the meta info from the database to find out if this happened or not).

My initial tests show that I can get around 200-500 requests a second on a real world server by using this model. Not too bad in my opinion.

PHP And pg_connect: Post Mortem

Sunday, February 24th, 2008

In my previous post, I tried to find out why a PHP script was so slow. Here’s an update from then.

Basically, pg_connect is slow. There’s probably a lot of PHP overhead to connect to a database plus it probably expects to be opening a fresh connection everytime, as using pooled connections doesn’t seem to speed it up very much (it goes from ~50 requests a second to ~200). Persistent connections seem to be much faster, albeit much slower than connecting to a pool with Python (450 requests a second compared to 1000), however, if you understand the inner workings of persistent connections you will know that using them with PostgreSQL is seriously broken, as a small number of users connecting to different sites on the same server can easily exhaust max_connections. Using pg_pconnect to connect to a connection pool doesn’t work either; it’s like combining the worst of both: the slowness of pg_connect, and the broken-ness of pg_pconnect.

So there’s pretty much no solution to this problem at this point in time, except maybe to port the script to Python.

humongous professional douchebag

PHP Is Slow… Really Slow

Friday, February 22nd, 2008

Lately, I’ve been trying to figure out what’s slowing down a site that’s currently in development… It didn’t take long to figure out what it was.

There’s a script on the server that handles image requests transparently, ie, utilizing mod_rewrite. Basically what it does is this:

  1. Connect to database.
  2. Query the database for the meta info of the requested file (created, modified, filetype, etc).
  3. Check if the client has a cached copy of the image, and if it does, send a 304: Not Modified, an ETag, and return.
  4. If not, grab the file from server-side cache if it exists, send it, and return.
  5. Else, fetch the image from the database, cache it locally on disk, and send the data.

99% of the time, all it does it grab the meta data, and send a 304 Not Modified, yet, it is extremely slow, so I decided to do a quick benchmark of PHP and Python, comparing how fast they deliver a small image (about 10kb) from a PostgreSQL database… The results were shocking.

The setup is very modest, an Athlon 64 3200+, 2GB ram, and a 500GB 7200.10 SATA II hard drive. The test is very simple: connect to the database, of which the connections are pooled by pgpool, fetch and send the image, then close the database connection.

The code used is as followed.

PHP:

// Note it's port 5433. That's the port that pgpool listens on.

$con = pg_connect("dbname=testdb user=testuser host=127.0.0.1"
    . " password=password port=5433");

$res = pg_query("SELECT data, filetype FROM images WHERE id = 1");
$res = pg_fetch_assoc($res);

header('Content-Type: ' . $res['filetype']);
echo pg_unescape_bytea($res['data']);

pg_close($con);

And now Python:

import psycopg2
from mod_python import apache

def handler(req):
    con = psycopg2.connect("dbname='testdb' user='testuser'"
        + " host='127.0.0.1' password='password' port='5433'")
    cur = con.cursor()

    cur.execute("SELECT data, filetype FROM images WHERE id = 1")
    row = cur.fetchone()
    con.close()

    req.content_type = row[1]
    req.write(row[0])
    return apache.OK

Pretty simple, right? I ran the test using ab, or Apache Benchmark, with the command:

ab -c 10 -t 15 http://localhost/<scriptname>

I ran the test three times for each script, here’s the results:

PHP - Maximum Requests/Second: 145.32
Python - Maximum Requests/Second: 1088.64
Apache2 Static Image - Max Requests/Second - 2778.54

WOW! Python is nearly 8 times faster than PHP in this test! I was so intrigued, I decided to dig deeper to see exactly what the bottleneck is in PHP.

I created a new test: Connect to the database, and close the connection immediately, don’t send any data whatsoever.

Here’s the results from that:

PHP - Max Requests/Second: 205.00
Python - Max Requests/Second: 1089.32

So PHP gains ~50 requests/second, and Python has no improvement at all. But what if we just leave the scripts blank? Basically just return and do nothing at all? Here’s the results from that:

PHP - Max Requests/Second: 1566.77
Python - Max Requests/Second: 1819.46

So basically, PHP sucks, at least when it comes to PostgreSQL connections. What’s strange is PHP doesn’t seem to have the same problem with MySQL connections, where by simply connecting to the database, performance drops significantly, so something is certainly wrong with the way PHP handles PostgreSQL. I bet there’s a lot of “MySQL vs. PostgreSQL” benchmarks out there that wield wildly inaccurate results because of this. Actually, if you compare MySQL and PostgreSQL purely in Python, you’ll find that PostgreSQL is quite a bit faster at handling binary data, such as images. I will be back if I find more information or a fix for this behavior.

UPDATE: it looks like the problem lies specifically with the pg_connect function. This problem doesn’t exists with pg_pconnect, however, you cannot connect to a database pool like pgpool/pgbounder with pg_pconnect, as that degrades performance.

PHP And pg_pconnect - Persistent Connections Suck

Thursday, January 10th, 2008

Lately, a server I’ve been maintaining has been running out of database connections. Upon further investigation, the culprit is clearly PHP’s pg_pconnect function. Persistent connections with PHP is buggy to say the least, and it’s often mistaken for connection pooling. Connection pooling is different from the persistent connection implementation that PHP uses, since persistent connections are only per Apache process. That means that each Apache process (likely per user), has it’s own pool of connections that only it can use. What’s worse is the connection isn’t closed until the Apache process times out, so most of the open connections are idle. The flaw in this design is astonishing. Right now, there’s 38 open database connections and only two of them are actively being used. Also, by default, PHP has pgsql.max_persistent set to -1, meaning each Apache process is allowed unlimited database connections. If you’re set on using persistent connections over pooling, you should set that to no more than 1 or 2.

The real solution is to use a “middleman” process to provide true connection pooling. This can easily be done easily with pgpool.

This is the pgpool configuration I’m using right now:

listen_addresses = ‘localhost’
port = 5433
socket_dir = ‘/tmp’
backend_host_name = ‘localhost’
backend_port = 5432
backend_socket_dir = ‘/var/run/postgresql’

Now, the only change you have to make to your PHP projects, is to use pg_connect instead of pg_pconnect, and connect to port 5433, instead of 5432. That’s it! Pgpool will automatically pool your database connections, and you will avoid running out of connections!

hmnos ocea
uogu rfsinlduhbg

Preserving Database Compatibiliy

Friday, November 30th, 2007

There’s always a lot of talk about preserving database compatibility, creating database abstraction layers, providing syntax compatibility, etc. At some point, you have to ask yourself, is it worth the effort? In most cases, it’s not. Although I highly recommend database abstraction layers, like PEAR::MDB2, unless you really need to have a project run on more than one database system, trying to keep compatibility is just silly. The reason is, MySQL, PostgreSQL, and Oracle all differ in several major ways. For instance, PostgreSQL and Oracle use Sequences, and MySQL uses it’s own auto_increment for primary keys, and the name of datatypes in Oracle differ greatly from the names of datatypes in MySQL and PostgreSQL. Also, each of those databases differ in behavior, such as MySQL having loose standards, and PostgreSQL and Oracle having strict standards, which can be seen with INSERTs on MySQL, where it will accept data that doesn’t meet the restrictions put on the column.

Keep in mind that in order to keep compatibility, you have to sacrifice speed. Now, for most small projects, sacrificing speed isn’t that big of an issue, but for large high traffic websites, sacrificing speed costs a lot, in server requirements, scalability, and user experience.

If you’re thinking of making a cross-database project, ask yourself these questions:

  • Are you expecting your project to be used on a wide variety of servers?
  • Are you expecting a large base of people who rely on your project?
  • Will making your project only compatible with a single database hurt it?
  • Will your project be publicly available, or sold to several different clients?

If you answer yes to one of these, than making your project cross-database compatible may be a good idea. Also, if you’re making something like a forum or blog package, than cross-database compatibility is probably a good idea.