Counting Bugs

The typical organizational approach to software bugs baffles me completely. When I first went to work at HP Labs I used the one system for categorizing and fixing bugs that seemed to work. For some reason, the Lab became afflicted with the desire to subscribe to popular industry fashions ("methodologies"), and abandoned it.

I took the old system to my new job, and then to another job after that, and during the last fifteen years I have revised it on my own while still keeping the essential framework intact. I have continued to use it.

Goals

The goals of any useful bug system are clear cut:

  1. Provide reliable, repeatable, factual information to the project manager and the business owners so that good decisions can be made.
  2. Be as objective as possible so that politics are confined to decisions about what to do with the information rather than how to categorize problems. In other words, anyone performing the analysis of the problem should reach the identical conclusion about the category regardless of what action management later takes to deal with it.
  3. Be able to reuse the same bug evaluation system across time, regardless of the development circumstances. After all, how can you tell if you are making progress if the stats don't line up well against previous projects?

Vocabualry

Most systems (including mine) incorporate some version of the terms priority and severity. Priority generally provides an answer to questions of urgency, whereas severity is often projected to be associated with how much work is required to fix the problem.

Most other systems provide a middle ground in the form of medium priority and medium severity. In most applications of the systems, a liberal dollop of subjectivity and political counsel propels a plurality (if not an outright majority) of the bugs into the pile that is medium on both axes: medium severity and medium priority.

The result of these failures of will is little more than admitting that the system is inadequate.

Dealing with Priority

There are only two levels of urgency. No matter the problem, no matter the product, it either can or cannot be shipped with the problem unresolved, the bug un-repaired. In my system, the priority axis is just a checkbox, a bit, a yes-or-no:

  • show-stopper: The product may not be shipped until this problem is repaired.
  • not a show-stopper: We may fix it or we may not, but this bug will not stop us from shipping the product.

Reductionism? Surely. But it is also the truth. There is no need for a priority assignment with some degree of finesse, and there is no benefit from maintaining a fiction that one is in use the way that polling uses "somewhat agree," "agree," and "strongly agree" to draw the person being interviewed into one camp or the other.

Dealing with Severity

If you ask a programmer to estimate the severity of a bug, you are likely to, instead, get the subjective answer to another question: "How hard do you think this bug is to fix?" While the answer is important to know, it is not the answer to the objective question that needs to be asked: "What damage to the product is done by this bug?"

It is equally important to be able to make as much use as possible of the scoring system. The expression priority 1 may make conceptual sense because it aligns with the idiom top priority, but giving low numerical values to the most severe bugs hinders subsequent statistical analysis.

Instead, my system assigns low numbers to things that are not terribly severe, and higher numbers to more severe bugs.

How many levels do we need? How high can the numbers go? The choice is somewhat arbitrary, or perhaps it is subjective to use the earlier vocabulary. However, objectivity primarily consists of taking a confine-and-define approach to the ever present subjectivity.

Here are nine levels, adapted somewhat from the HP system. They have served me well.

  1. A cosmetic defect. For example .. spelling errors, incorrect colors, improper alignment of screen devices. Note that these are hardly ever seen outside the UI.
  2. A corner case. The error is present only after a lengthy set of steps to create a specific environment that would be unlikely to be encountered (often) in general use.
  3. Documentation errors -- meaning the error can be fixed by altering the documentation, not necessarily that the documentation was itself in error when the bug was found.
  4. The defect has a work around that is obvious. This class of error is common and we learn to live with them. Most software, and even most devices, usually have more than one provision to perform an operation.
  5. The defect has a work around, but the work around is not obvious. The big difference between this an Severity 4 is that Severity 5 runs the risk of costing the company money in terms of support. People seek help from forums, call the 800 number, or otherwise inflict financial damage on the provider.
  6. Serious error that is not (yet) a full feature failure. This worrisome classification can be seen as a Severity 5 for which there is no work-around, but there is some functionality.
  7. Broken feature. The feature is inoperable, or works incorrectly. Assuming one is using the definition of alpha that implies functionality is complete, these bugs often show up as the dreaded bug-from-a-bug-fix.
  8. Catastrophic failure ... without loss of work. We all know this one: the program crashes unexpectedly, but when we restart there is some way to recover most of our work.
  9. Sudden catastrophic failure with collateral damage. Program crashes and leaves the file system corrupt -- just as an example.

How to use this system

In practice, high severity bugs are res ipsa loquitur show-stoppers. It is hard to imagine how one would stay in business selling a product filled with Severity 9 bugs. Just ask a Corvair owner. But it is also possible to have Severity 1 bugs kill a product, and in this case we turn to the Edsel. For example, an inadvertent misspelled word might be a profanity, in which case it would most definitely be a show-stopper.

A full discussion is beyond the scope of a web article, but there are couple of general principles that may be applied: for example, corner cases really cannot exist until testing has advanced to a point where the product has been well enough examined to decide whether an event is difficult to reproduce, and documentation errors are uncommon until it has been written.

There are three key features that can be discussed:

Plan of attack

In a lot of systems the reward structure for programmers promotes claims of success such as "I have closed six bugs this week," or martyrdom, as in "I have been working on this bug for a week," neither of which contributes materially to a projection of the product's nearness to a release so that the company can begin making money.

In my system, we always work on the highest severity bugs first, in part because they do the most damage, and in part because the causes (and the associated repairs) tend to be related to less severe bugs. In my own experience it has been usual to fix a Severity 9 bug and have several 6-s and 5-s be closed out at the same time.

As a project manager, you should know that the correlation between effort-to-fix and severity is almost nil. Many Severity 9 bugs are extremely easy to fix because their consequences are so severe. But they might be difficult. On the other hand, the fix for many Severity 5 bugs may involve a long quest to provide functionality that is not absolutely necessary, and can be difficult to shoe-horn into the product.

Aggregation

Simply adding up the severities of the bugs will provide a meaningful result. If the sum of the severities of a hundred bugs is 450, and after a fixing a few and finding a few new ones you are left with ninety bugs with a total severity of 460, you can be sure that your original testing was woefully inadequate or the plan was not executed.

HP used a sum-of-the-squares figure for determining a statistic that was called "bug weight." In my experience there and afterwards, the squares of the larger severities do give a better picture of the stability of the product, and I recommend the use of squares. If you are familiar with the sabermetric concept known as Bill James' pythagorean theorem, and you are comfortable with programming only slightly more math into your spreadsheet, there is a good argument to be made that the best exponent is also 1.8 for bug weight. It is probably just a cosmic coincidence, rather than a profound insight into the workings of either software or baseball.

Finally: Is this always useful?

No. My view about the appropriateness of this metric is not set in stone. For example, the first big software project on which I had an ongoing role was the structural damage simulator for commercial aircraft at Boeing. After what happened, it was declared that there could simply be no bugs in the new prediction system, so a statistical evaluation of probabilities of failure in the software system was not appropriate for a system used to predict the probability of failure in the aircraft.

The system described here is maximally useful when these properties are associated with what you are trying to deliver: It is a moderately large system similar to other projects produced by the group or the company over time.