Umbilicus Mundi Blog

Optimization, a bedtime story

Originally published 9 November 2011

 

At every interview I have been to, there is someone who asks me about optimization, never suggesting optimization of what property, and assumed to be speed of execution. The question is usually asked in a slightly covert way, conspiratorially, and the in a tone of voice I associate with a thirteen year old who has just discovered kissing, and wonders if you, as an adult, may also know of its joys.

My answer is not the one that the interviewers seem to want to hear, but it is the truth: "I attempt no optimization until there is a demonstrated need." Perhaps a case study (or two) would help.

A very short case study

From a friend:

"I had a young web developer working for me. We were checking in some code and I noticed he had changed some of my Python scripts. I asked him why he had done it. He explained that my code was quite inefficient and he had optimized it. So, we ran my inefficient code and timed it; then ran his optimized code and timed it. Zero difference. He had not compared the running time of the two versions of code."

Sigh. My experience is that this kind of thing happens regularly.

A more serious case study

So, how about a detailed look at the process?

Before we begin, let's put the Axiom of Optimization front and center:

" We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

This advice has its origin, at least in print, with Donald Knuth in 1974. I was still five years away from programming at that time, so the advice predates me. But my first serious mentor in the world of programming for money was a guy named Dr. Larry Andrews. "Dr. Larry" put me on the straight and narrow when I was 28 years old, and I have been doing things his way since.

Two months ago I wrote bLaTheR, a Lorem Ipsum replacement. For me was part a learning experience to absorb jQuery, part load testing for a server process, and part entertainment. There is nothing difficult about what is being done; the program searches a database of text, rearranges it, and returns some trashed up text that statistically resembles the input.

The simplicity of bLaTheR made it the ideal program for doing optimization explorations.

With bLaTheR running, it was time to study its performance. It had been written in the QUick-And-Dirty (QUAD) style of most weekend projects, and therefore probably the most unoptimized approach imaginable. But, the unoptimized approach allowed me to clearly instrument the steps:

  1. get the text from the database, assemble any multiple shreds, and put the result in a buffer
  2. parse the buffer into a trie (single pass) suitable for the derivation
  3. navigate the trie to generate the text, and stuff the text into the return message.

As I mentioned, bLaTheR was slipped into the framework of an existing daemon that was known to work well in terms of the basic daemon functionality, so these three steps were the three places to look for improvements rather than the daemon's framework; a key benefit of code reuse.

Instrumentation in the form of a POSIX microsecond stopwatch gave some rough order of magnitudes. I chose the benchmark of a 2K byte shred (typical use) and an analysis depth of 5 using a 20K byte source.

StepBaseline Time
[1] Database I/O 2ms
[2] Parse text and build trie50ms
[3] Generate the bLaTheR text 9ms
Total:61ms

It is clear which step needs the work. One of the first mistakes in optimization is to go after something that is already quick just because you can figure it out or because you suspect it. In the above example, if we could do the text generation in a single CPU cycle, we would still be left with 52/61 of the work. It is frequently tempting to point one's early suspicions toward the I/O, but in this case neither the I nor the O appears to be at fault.

(One other fact of note, step [3] is linear with respect to the amount of bLaTheR that is requested. The other steps take a constant amount of time per request and have nothing to do with how much text is requested.)

I started down the obvious path. There was no need to parse the same text each time to build the trie. The difficult part, of course, was finding a way to store the trie in the database -- a coding adventure sure to pre-bug or re-bug the working code.

But the next day I had a new version of bLaTheR ready to go. I added a trie-table to the database, and put logic in the code to check there first. If the appropriate trie was not found, bLaTheR would go get the text and build the trie, store it, and then try trie again. No text would ever be parsed more than once, so regardless of how slow the parsing operation might be, it was eliminated from the overall performance. Another lesson in optimization: operations that are only done once are usually not worth consideration for performance tuning.

I reworked the instrumentation so that I now had three new steps:

  1. get the trie from the database.
  2. assemble it in memory.
  3. navigate the trie to generate the text, and stuff the text into the return message.

I spun up the new version of bLaTheR expecting a staggering improvement. I was half correct: it was staggering.

StepTimeDelta
(from baseline)
[1] Database I/O 60ms+58ms
[2] build trie 35ms-15ms
[3] Generate the bLaTheR text  9ms- - -
Total:104ms+43ms

Clearly my method of storing the trie in SQLite was ill considered. OK, it wasn't really considered at all; in my haste to look like an expert for Scott and other friends, I had minimized the time spent thinking about the problem rather than the execution time of the code. Another day, another 86,400,000 milliseconds lost to ego.

After yet one more day, I had a much better method of storing the trie, and I got the load time down to 4ms. This system also resulted in fewer steps to assemble the trie in memory, which improved things a bit more. So now I was left with:

StepTimeDelta
(from baseline)
[1] Database I/O 4ms +2ms
[2] build trie 30ms-20ms
[3] Generate the bLaTheR text  9ms- - -
Total:43ms-18ms

There are two things to consider at this point, and as is so often the case there is good news and bad news.

The good news

is that bLaTheR scales almost perfectly linearly and is well suited to multi-processing and multiple threads of execution. In part, this is facilitated by a clear separation of the reading and the writing: we peruse the trie repeatedly, but after it is built we never change it. The only writing takes place to the output buffer that is a part of the message to be delivered.

The bad news

is that bLaTheR is intended to be a web service. As a web service, 43ms for a response time is pathetic. If one of Scott's web pages made six requests for bLaTheR text we could be talking about a quarter of a second to generate his copy. With performance numbers like that, Lorem Ipsum doesn't look so bad.

Before I continued, I engaged in the one traditional optimization, namely the optimization of the code as I wrote it.

I hacked on the 30ms block and improved the loop performance of the code that was "moving" the data from the database records to the trie nodes. I realized that for any given trie, the nodes were all the same size, and I used a bit of unprotected assembly language to move the data the way the transporter on Star Trek moves people, or apparition moves wizards in Harry Potter. Much to my pleasant surprise, this got the time down to about 21ms.

Still ... 34ms total is not what we need.

Situations like these are why God, or Kibo, or perhaps Donald Knuth gave us caching. Caching is not the solution to everything. Any Java programmer who has suffered through the performance problems associated with searching the cache in the JVM knows of what I speak; fortunately, I do not. Without proper care the un-groomed cache becomes a kind of Fibber McGee's closet for the detritus of database mistakes, and its performance eventually grows worse than the uncached performance.

For this reason, there are several cache management strategies by which the cache undergoes population control, or as Fibber McGee said, "I gotta get that closet cleaned out one of these days."

Here are three well known ones:

LRU:Least Recently Used .. throw out those boxes of stuff at the back that we haven't opened since we moved here. I don't care if it is the sterling silver.
LFU: Least Frequently Used - throw out the turkey roaster even if we need it tomorrow; we only use it once a year.
Random:Something has got to go, or Molly McGee's solution if you have heard the radio show.

Fearing that I was once again about to succumb to the apple of temptation known as premature optimization, I went for the random method. After all, it requires no bookkeeping and it is hard to imagine anything worse for performance than a bug in the bookkeeping for this or any cache.

Thinking carefully about the use model where the requestor is likely to make several consecutive requests for the same flavor of bLaTheR, I opted for a slightly non-random replacement by employing the simple heuristic that the cache should never expel its most recently used member. While this choice ensures nothing, it does make it unlikely that a sequence of requests will be interrupted by a cache operation that removes an item that will be needed right away.

So where did we finish?

StepTimeDelta
(from baseline)
[1] Cache I/O<1ms-1ms, at least
[2] build trie  0ms-50ms
[3] Generate the bLaTheR text  9ms- - -
Total:9ms-51ms

Last updated 2017-02-23T20:04:52+00:00.