Archive for the 'Tech Talk' Category

MySQL bug

Tuesday, 27th April, 2010

I can’t remember the last time I stumbled across a bug in MySQL, however this morning I did.

Querying the following table to locate entities with more than one title attribute (this is a diagnostic query, not used in the application).

CREATE TABLE `entity_attr` (
  `entity_id` int(11) NOT NULL default '0',
  `entity_attr_id` int(11) NOT NULL default '0',
  `seq_no` int(11) NOT NULL default '0',
  `lang_id` int(11) NOT NULL default '0',
  `intval` int(11) default NULL,
  `strval` text,
  `created` timestamp NOT NULL default CURRENT_TIMESTAMP \
   on update CURRENT_TIMESTAMP,
  `import_source_id` int(11) NOT NULL default '0',
  PRIMARY KEY  (`entity_id`,`entity_attr_id`,`seq_no`,`lang_id`),
  KEY `entity_attr_id` (`entity_attr_id`,`strval`(10)),
  KEY `entity_attr_id_2` (`entity_attr_id`,`intval`),
  KEY `k_entity_id` (`entity_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

The query:

SELECT entity_id, COUNT(*) AS num FROM entity_attr \
WHERE entity_attr_id = 10 GROUP BY entity_id HAVING num > 1 LIMIT 10;
+-----------+-----+
| entity_id | num |
+-----------+-----+
|    228896 |   2 |
|    246726 |   2 |
|    409089 |   2 |
|    409091 |   2 |
|    409098 |   2 |
|    409104 |   2 |
|    409105 |   2 |
|    409106 |   2 |
|    409107 |   2 |
|    409108 |   2 |
+-----------+-----+
10 rows in set (2.65 sec)

OK, so there are occurrences of entities with multiple title attributes, what’s the most amount of titles a single entity has?

SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 \
GROUP BY entity_id HAVING num > 1 ORDER BY num DESC LIMIT 10;
+-----------+--------+
| entity_id | num    |
+-----------+--------+
|    409109 | 525193 |
+-----------+--------+
1 row in set (0.93 sec)

Only one result? Wow, 525k titles! Really?

SELECT COUNT(*) FROM entity_attr WHERE entity_id = 409109 \
AND entity_attr_id = 10;
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

No.

Lets take another look, remove the limit.

SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 \
GROUP BY entity_id HAVING num > 1 ORDER BY num DESC;
+-----------+-----+
| entity_id | num |
+-----------+-----+
|    409104 |   2 |
...
+-----------+-----+
107 rows in set (0.00 sec)

And if we bring it back.

SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 \
GROUP BY entity_id HAVING num > 1 ORDER BY num DESC LIMIT 107;
+-----------+--------+
| entity_id | num    |
+-----------+--------+
|    409109 | 525193 |
+-----------+--------+
1 row in set (0.83 sec)

Looks like LIMIT is causing the issue here. Looks like 5 is a magic LIMIT number too, changing the entity ID in the result:

Limit 4:

+-----------+--------+
| entity_id | num    |
+-----------+--------+
|    409109 | 525193 |
+-----------+--------+
1 row in set (0.83 sec)

Limit 5:

+-----------+--------+
| entity_id | num    |
+-----------+--------+
|     55446 | 525193 |
+-----------+--------+
1 row in set (0.00 sec)

Limit 6:

+-----------+--------+
| entity_id | num    |
+-----------+--------+
|    409109 | 525193 |
+-----------+--------+
1 row in set (0.82 sec)

So how does the query plan change between these statements? (apologies for the formatting)

EXPLAIN SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 GROUP BY entity_id HAVING num > 1 LIMIT 10;
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
| id | select_type | table       | type  | possible_keys                   | key     | key_len | ref  | rows   | Extra                    |
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | entity_attr | index | entity_attr_id,entity_attr_id_2 | PRIMARY | 16      | NULL | 158569 | Using where; Using index |
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+

Using the magic limit of 5:

EXPLAIN SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 GROUP BY entity_id HAVING num > 1 LIMIT 5;
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+-------+--------------------------+
| id | select_type | table       | type  | possible_keys                   | key     | key_len | ref  | rows  | Extra                    |
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | entity_attr | index | entity_attr_id,entity_attr_id_2 | PRIMARY | 16      | NULL | 79221 | Using where; Using index |
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+-------+--------------------------+

Removing the limit (where we get correct results):

EXPLAIN SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 GROUP BY entity_id HAVING num > 1;
+----+-------------+-------------+------+---------------------------------+------------------+---------+-------+--------+----------------------------------------------+
| id | select_type | table       | type | possible_keys                   | key              | key_len | ref   | rows   | Extra                                        |
+----+-------------+-------------+------+---------------------------------+------------------+---------+-------+--------+----------------------------------------------+
|  1 | SIMPLE      | entity_attr | ref  | entity_attr_id,entity_attr_id_2 | entity_attr_id_2 | 4       | const | 453654 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------------------------+------------------+---------+-------+--------+----------------------------------------------+

Could there be an issue using the primary key? Lets rebuild the indexes on this table.

REPAIR TABLE entity_attr QUICK;
+------------------+--------+----------+----------+
| Table            | Op     | Msg_type | Msg_text |
+------------------+--------+----------+----------+
| muco.entity_attr | repair | status   | OK       |
+------------------+--------+----------+----------+

Re-run our query:

SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 \
GROUP BY entity_id HAVING num > 1 ORDER BY num DESC LIMIT 5;
+-----------+--------+
| entity_id | num    |
+-----------+--------+
|    409101 | 525193 |
+-----------+--------+

No joy. Lets simplify slightly:

SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 \
GROUP BY entity_id HAVING num > 1 LIMIT 5;
+-----------+-----+
| entity_id | num |
+-----------+-----+
|    228896 |   2 |
|    246726 |   2 |
|    409089 |   2 |
|    409091 |   2 |
|    409098 |   2 |
+-----------+-----+

Looks like we’re good if we strip the ORDER BY clause. Are we still relying soley on an index to find rows?

EXPLAIN SELECT entity_id, COUNT(*) AS num FROM entity_attr WHERE entity_attr_id = 10 GROUP BY entity_id HAVING num > 1 LIMIT 10;
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
| id | select_type | table       | type  | possible_keys                   | key     | key_len | ref  | rows   | Extra                    |
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | entity_attr | index | entity_attr_id,entity_attr_id_2 | PRIMARY | 16      | NULL | 158569 | Using where; Using index |
+----+-------------+-------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+

Yup.

So a mix of GROUP, HAVING, ORDER BY with LIMIT doesn’t play well on this version of MySQL.

This is bug wasn’t present on our production servers (running 5.0.x). This issue is exhibited from:

mysql  Ver 14.14 Distrib 5.1.37, for debian-linux-gnu (i486) \
using EditLine wrapper

Server version:	 5.1.37-1ubuntu5.1-log (Ubuntu)

I’ve not tested this on later releases of the 5.1.x branch. I need a machine I can dirty with two MySQL installs and (on Ubuntu) need to compile from source for the latest (at the time of writing) 5.1.46 release.

UPDATE: I installed Ubuntu 10.04 tonight (April 30th) on a VM and tested this query, the problem persists in Lucid Lynx (version 5.1.41-3ubuntu12).

Using sed to parse out C style multiline header comments

Saturday, 17th April, 2010

This was just too painful not to blog about. I hope this saves someone else some pain of reading sed documentation to put together a pattern to match C style header comments.

In my case I’m putting together a build script, part of it is to minify Javascript before deployment. The source file is in the format:

/**
 * My comment
 *
 * Description ...
 * @author me
 */

var source = 'code';

When minifying the code using YUI Compressor, it strips the comments. To include them in the minified file I’ll parse them out the source file and pipe them to the minified file, before appending output from the YUI Compressor.

Cutting to the chase:

sed -n '/^\/\*/p; /^ \*/p' < file.js

A brief description, the first pattern looks for the opening comment:

/^\/\*/

The second pattern matches the rest of the lines.

/^ \*/

The -n switch suppresses sed printing output. The /p modifier duplicates the matched line so combined they mimic grep behaviour.

Simple eh? It is in hindsight… but I exhaled a few expletives to get here. I guess that’ll teach me for being stubborn, I could have put together a PHP script to preg_match the comments in two minutes.

IE8 and CSS opacity = fail

Thursday, 1st April, 2010

I’m putting together a new look for this blog, it’s a much needed 2010 refresh, the current design is from 2006.

An element of the new design uses a semi-transparent opacity. Not a problem in Firefox, Chrome or Safari. This shouldn’t be a problem even in IE7 but Microsoft saw fit to screw IE over once again and break CSS opacity styling with IE8. I’m possibly in a worse case scenario, using absolute positioned divs, initially set with display: none.

I’ve tried all the recommended approaches without success.

position: absolute;
background-color: black; /* for IE */
background: rgba(0,0,0,0.7);
-ms-filter: "progid:DXImageTransform.Microsoft.Alpha(Opacity=70)";
filter: alpha(opacity=70);

I’ve tried other combinations too including the method on W3 Schools, which works when I view it in IE8 (but doesn’t on my site).

opacity:0.4; /* effects other browsers */
filter: alpha(opacity=60);

I guess if I don’t resort to a transparent PNG/GIF background, my new design will have to look a little ugly in IE8.

For now I’ve ripped out all the IE specific CSS attributes, at present IE accounts for < 5% of traffic on my blog. I’m not sure it’s worth the effort, as this is due to change with IE9 (fingers crossed on CSS3 support).

Installing PHPUnit on Snow Leopard

Saturday, 27th March, 2010

Snow Leopard ships with PHP 5.3.0, which at the time of writing, is quite recent (at least it’s a milestone version with a bunch of new features). I’ve not tried to install anything previously via PEAR on my Snow Leopard install.

When I tried to install PHPUnit, I ran into a problem:

$ sudo pear channel-discover pear.phpunit.de
Adding Channel "pear.phpunit.de" succeeded
Discovery of channel "pear.phpunit.de" succeeded
$ sudo pear channel-discover pear.symfony-project.com
Adding Channel "pear.symfony-project.com" succeeded
Discovery of channel "pear.symfony-project.com" succeeded
$ sudo pear install phpunit/PHPUnit
Did not download optional dependencies: pear/Image_GraphViz, pear/Log, symfony/YAML, use --alldeps to download automatically
phpunit/PHPUnit requires PEAR Installer (version >= 1.8.1), installed version is 1.8.0
phpunit/PHPUnit can optionally use package "pear/Image_GraphViz" (version >= 1.2.1)
phpunit/PHPUnit can optionally use package "pear/Log"
phpunit/PHPUnit can optionally use package "symfony/YAML" (version >= 1.0.2)
phpunit/PHPUnit can optionally use PHP extension "xdebug" (version >= 2.0.5)
No valid packages found
install failed

Oh noes! Time to upgrade PEAR. First check the PEAR channels are up-to-date:

$ sudo pear update-channels
Updating channel "doc.php.net"
Channel "doc.php.net" is up to date
Updating channel "pear.php.net"
Update of Channel "pear.php.net" succeeded
Updating channel "pear.phpunit.de"
Channel "pear.phpunit.de" is up to date
Updating channel "pear.symfony-project.com"
Channel "pear.symfony-project.com" is up to date
Updating channel "pecl.php.net"
Update of Channel "pecl.php.net" succeeded
$ sudo pear upgrade PEAR
downloading PEAR-1.9.0.tgz ...
Starting to download PEAR-1.9.0.tgz (291,634 bytes)
.............................................................done: 291,634 bytes
downloading Archive_Tar-1.3.6.tgz ...
Starting to download Archive_Tar-1.3.6.tgz (17,600 bytes)
...done: 17,600 bytes
downloading Structures_Graph-1.0.3.tgz ...
Starting to download Structures_Graph-1.0.3.tgz (30,191 bytes)
...done: 30,191 bytes
upgrade ok: channel://pear.php.net/Archive_Tar-1.3.6
upgrade ok: channel://pear.php.net/Structures_Graph-1.0.3
upgrade ok: channel://pear.php.net/PEAR-1.9.0
PEAR: Optional feature webinstaller available (PEAR's web-based installer)
PEAR: Optional feature gtkinstaller available (PEAR's PHP-GTK-based installer)
PEAR: Optional feature gtk2installer available (PEAR's PHP-GTK2-based installer)
PEAR: To install optional features use "pear install pear/PEAR#featurename"

Now retry the install of PHPUnit.

$ sudo pear install phpunit/PHPUnit
Did not download optional dependencies: pear/Image_GraphViz, pear/Log, symfony/YAML, use --alldeps to download automatically
phpunit/PHPUnit can optionally use package "pear/Image_GraphViz" (version >= 1.2.1)
phpunit/PHPUnit can optionally use package "pear/Log"
phpunit/PHPUnit can optionally use package "symfony/YAML" (version >= 1.0.2)
phpunit/PHPUnit can optionally use PHP extension "xdebug" (version >= 2.0.5)
downloading PHPUnit-3.4.11.tgz ...
Starting to download PHPUnit-3.4.11.tgz (254,439 bytes)
.....................................................done: 254,439 bytes
install ok: channel://pear.phpunit.de/PHPUnit-3.4.11

We’re good to go.

PHP UK Conference 2010 thoughts

Tuesday, 2nd March, 2010

On Friday I went to the PHP UK 2010 conference in Islington, London with my colleague John Field. There were three tracks, so between us we were able to cover most talks of interest. The conference organisers have said they’ll post the slides of each talk, with sync’ed audio, online soon. I’ll link the slides for these talks as they become available.

From the schedule, the talks I attended were:

PHP UK Conference 2010 talkI forgot to mention it on my feedback form but the WiFi was very good, considering there was an army of nerds hammering on it, it was fast enough and I never lost my connection (disclaimer: I was largely using it from my iPhone).

For anyone thinking of going next year, be warned, everyone had an iPhone – I’m pretty sure they don’t let you into the conference if you don’t own one! ;)

The talks prior to lunch were, time-wise, somewhat sabotaged by the key note speech overrunning.

Key note – the lost art of simplicity

Talk rating: 8/10

Josh Holmes raised some good points in his key note talk.

If you can’t clearly explain your solution, it’s probably not worth implementing

Josh said this was particularly important to ‘Enterprise’ business, where systems that go live can run for longer than your professional career. It’s a certainty that other developers will need to maintain or modify your code.

Gone are the days of a lone programmer writing enterprise software, free to obfuscate the code to keep his maintenance contract. If you’ve not distilled your solution down to what’s needed, and nothing more, you’re guaranteeing someone else extra mental baggage.

Developers have a tendency to over-engineer solutions

Sometimes we try to future proof our designs by adding features we think might be needed but in reality are sometimes never used. Ultimately this means we just wasted our time and added needless complexity.

Temptation to throw in the kitchen sink is stronger in the early years of your career as a developer, when you’re eager to use new technologies. Experience teaches you to concentrate on the problem and to solve it with the most suitable tool (at work anyway).

We strive for this at work, Paul (the CTO) has said on countless occasions ‘do the simpliest thing that works‘.

Usability testing

This topic highlighted to me that we (developers at ASP) never meet the users of our software. Probably not a surprise in web development. We interact heavily with the editors of the product but never the end users, they sit on the other side of customer support.

Although we’re planning to address usability issues by hiring a UX Manager, we do not perform usability testing; watching customers use our software, employing heat maps to see where the mouse pointer moves or use event logging to see which features are popular and which are failing. We do listen to feedback from ALA conferences and similar.

RDBMS in the social networks age

Talk rating: 7/10

This talk heavily contrasted the key note, which talked about simplicity. I couldn’t help but feel a sense of irony in the first 20 minutes, as this talk involved some complex SQL used to efficiently traverse graphs and trees. It was fortunate I spent some time in November working with Dijkstra’s algorithm, which provided enough of a refresher to help comprehend the concepts in this talk.

Although the topic was interesting, having never seen Common Table Expressions (CTEs) or Window Functions before, the speed at which Lorenzo moved through the slides meant it was demanding trying to keep up. I doubt I was the only one but that didn’t stop me feeling like a developer sat in a room full of DBAs. I don’t think it was so much the fault of the speaker, as the morning talks didn’t get their allotted hour.

After the talk, I found that MySQL has no support for CTE’s or Window Functions. Other major DB vendors such as Oracle (with Oracle SQL’s ‘CONNECT BY’ syntax for recursive CTEs), MS SQL 2005 and Postgres (since version 8.4) do support them.

You might be thinking ’so what? I’ve never needed them’. If you’re building any social features into your products, you’d be surprised how many common features like connections distance to you on Linked in, or friends suggestions on Facebook could / do make use of CTEs.

Database Optimisation

Talk Rating: 5/10

This was one of the more disappointing talks I attended. Remo Gaigioni talked about the issues his company encountered when getting their search marketing intelligence tool to scale. The talk covered a lot of topics familiar (you hope) to most developers, using EXPLAIN to optimize queries, moving larger text fields out into separate tables, use of InnoDB to minimize locking, using replication and memcached.

There were features more specific to Remo’s application such as job queues that you might not use in your own app. Exact details were omitted but it sounded like his app was also write heavy and struggled with table locks.

The nuggets of information I got from this talk:

Replication is single threaded

Spec’ing a beast of a machine to run as a slave doesn’t make your binary log delay disappear. On a typical 8 core machine, replication is going to run off one of them. That’s not the say your slaves shouldn’t be more powerful than your master because they should – dealing with a share of the reads along with repeating all the writes from the master.

Use a work queue service for job queues (aka right tool for the job)

MySQL really isn’t the most efficient tool for job queues, sure you can share queue status information across a cluster of machines easily but work queue services like beanstalkd do a better job. Any work you can take away from the database is going to help what remains in it.

Recommendations from the audience included using MySQL-MMM in place of HA proxy for load balancing cross-replicating masters, and switching out InnoDB for the Percona storage engine.

At this point, the day felt like a mental yo-yo coming down from Lorenzo’s talk to this comparatively high level discussion.

PHPillow & CouchDB

Talk Rating: 6/10

So I felt like I’d come full circle with this talk, from promoting CTEs in a RDBMS to this talk about a NoSQL DB. I was looking forward to this as I’ve been reading up on CouchDB recently. I’d read Kore Nordmann’s tutorial on PHPillow before attending this talk as well. I recommend the free book on the CouchDB site as an insight about how it differs from the traditional RDBMS model.

Unfortunately a large part of the talk was spent explaining CouchDB, as a consequence we had to pick one of three topics to discuss in more detail: PHPillow, views and Map Reduce or scaling CouchDB.

The majority vote went to scaling CouchDB. We still had to skip a lot of slides, my estimate is around 30/70 slides due to time constraints. I guess I would have felt like I got more out of this talk if I’d not read up on the subject beforehand.

Two points I noted from this talk:

  • CouchDB has no concept of schemas, consequently changing CouchDB views is less painful than ALTER statements in RDBMS’.
  • Debugging CouchDB’s Javascript views is a bit problematic at current. Erlang, the functional language it’s written in, apparently throws ambiguous error messages – a trait of functional languages.

‘In search of …’ integrating site search systems

Talk Rating: 9/10

I thought this talk was the best one I attended, and I’ve a page of frantically scrawled notes as proof. Ian Barber did a great job of introducing the key concepts search engines employ, and how you can tailor your own site’s search to best Google’s results. He covered MySQL’s fulltext search, Sphinx, Swish-e, Lucene, Solr and Xapian (which we use at work). There are PECL extensions available for all of these.

I thought about using of one of these search engines on this site and possibly combining it with a page ranking algorithm as detailed in the free second chapter of Manning’s Algorithms of the Intelligent Web.

Beating Google at it’s own game on your site is possible because you can employ ‘zone weighting’ (see slide 44). Google takes a guess using HTML tags, font sizes and other ‘one size fits all’ methods.

Features I thought might benefit products at work; use the spelling correction feature in Xapian and investigate sharding our Xapian databases to reduce the index size to address memory consumption in our ever growing data sets.

PHP code audits

Talk Rating: 5/10

My code is far from perfect, I attended this talk after all. I couldn’t help but have a little déjà vu during this session. I was reminded of a security course I attended back in 2005, the speaker used un-patched Windows 2000 Server machines to demonstrate exploits. This was good and all but we were 5 years on from it’s release, with 5 years of security patches under Win2k’s belt. When you tried out these exploits in the Real World™, none of them worked.

The same thing struct me about the code examples and warnings being presented to us here. Don’t use eval(), avoid using register globals, be careful of include() injection via register globals. Didn’t we have this talk back in 2005 as well?

With the introduction of frameworks in PHP land, a lot of code shifted from the procedural spaghetti code seen in these slides to more structured object orientated code. Most PHP developers know that object orientation promotes encapsulation and information hiding, so you have to go out of your way to write single scope scripts that might fall prey to these issues, if (for some unholy reason) you’re still running with register global enabled (then you’re just asking for it).

Writing OO code also means I never have to worry about include() & require() statements thanks to class autoloading. I use a database abstraction layer and so don’t worry about SQL injection due to using bind parameters. I don’t remember the last time I had to use a global variable because I think about my interfaces. The same goes for eval(), do you remember the last time you used it? You might have had me on escaping output, up until the point we implemented a multi-lingual frontend.

So unfortunately this talk was disappointing too, after coming out of Ian Barber’s on a high note.

Conclusion

Would I go to another PHP Conference? Maybe, it really depends on the subject of the talks. Having never been to a PHP community event, part of my attendance was due to curiosity.

On at least three occasions on Friday talks were cut short. Maybe in future, PHP conferences could have 90 minute slots, or talks that span two slots with an interval?

The the venue was good, location was good, directions were good, the WiFi was good, the free beer was good (I had one before leaving, thanks Facebook!) there just needs to be a bit of fine tuning with talk timings.

You are currently browsing greg's weblog archives in the Tech Talk category.

Categories

xhtml 1.1 compliant   xhtml 1.1 compliant