Insider: A Debugging Story
Peter Vogel looks at debugging from a different specific: One incident whose resolution depended, in part, on understanding who'd already searched for the bug and the history of the bug itself.
I've been doing a lot of LINQ programming recently, so my next column is probably going to be called something like "Stupid LINQ Tricks". Because LINQ, like SQL, is a declarative language you don't get a lot of code to look at when your code isn't doing what you want. But fortunately, much of your debugging effort shouldn't be obsessively examining code. I've even got a story that demonstrates the importance of understanding context, gathering evidence, doing an IS/is NOT analysis, and (the most important part) getting a good description of the bug.
I had been working at my first company for about a year when I got assigned "the negative inventory problem": For the previous two years, every once in a while an item in the company's inventory would show a negative balance. There was a report that ran every night that checked for these negative balances and it had found one for the first time in a couple of months. Several people had tried to find the problem before -- now it was my turn.
My debugging process is simple: First, get a good description of the problem and generate some preliminary theories; second, stabilize the bug (make it happen any time I want); third, fix the bug.
I started building the description by finding all the places in our applications where the inventory count was updated. I found, as I remember, three routines that set the inventory count: one set from user input and two that set it from other data in the database. I didn't spend much time looking at the code because, if that's where the problem was, I figured one of the smart, clever people who'd looked at the bug before me would have already found the problem. It was, however, input to the description.
After looking at the code, I was ready to generate some preliminary theories. I had three:
Improving the Description
- I didn't understand how the tools worked and wasn't reading the code correctly. This seemed the least likely because, while I was fresh out of school and didn't know much, the previous investigators probably did.
- The inventory count was being updated somewhere else and everyone had forgotten about it. This seemed pretty likely: the system had been around for a while and source control wasn't all it could be.
- Some of the data used to update the inventory balance weren't valid, and when used to update the inventory count, generated a negative number. This also seemed possible, but the system inputs made it unlikely there was bad data in the system.
I needed to improve my description, so I started to do an "IS/is NOT" analysis: The bug occurs when this IS true/the bug does not occur when this is NOT true. For instance, I could have instrumented the three routines to have them write to a log whenever they executed. The next time that an inventory count went negative, I could check the log and see which routine(s) had executed on the inventory item that went negative. If none of them had executed, that would point to option two (the bug is NOT associated with any known routine) and if only one of the routines had executed, that would point to that routine (the bug IS associated with this routine).
Before I did that, I took advantage of a log we already had: the bug reports listing every incident of our negative inventory account. I graphed that data by date to see what associations I could find (for instance: the bug IS happening on Mondays; the bug is NOT happening on holidays). What I discovered was that the bug had appeared out of nowhere about two years ago and had been occurring virtually every day. Gradually, however, the incidences had tapered off until, now, we might have a month without one.
Option 3 (bad data) seemed more probable now. The best explanation I could come up with was that something had been done just before the bugs started appearing that had "seeded" the database with bad data. Initially, the amount of bad data had been high so we had an incident every day. But -- I was extending my theory here -- whatever generated the bad inventory count (or the clean-up process afterwards) also "fixed" the bad data. As a result, the amount of bad data was decreasing over time and it was taking longer to hit bad data. Since the bugs had started on a Monday, the culprit was probably some job job run over the weekend which bypassed the normal inputs.
Stabilize and Fix
As soon as I described the bug this way to the rest of the team, two of the senior programmers who knew enough of the history of the system flagged the most likely cause. In literally two or three minutes, the guys were able to identify the job that would have generated the bad data and the data that the job would have inserted.
The next step was to make the bug happen whenever we wanted: To stabilize it. The guys ran a program that found some of the remaining data inserted by the weekend job two years ago, entered a transaction that would flag some of that data for processing, and waited for the report to show a negative balance the next morning. And the next morning it did. The guys then wrote a new program that found all the remaining bad data and fixed it. We never had another incident and retired the program that checked for the problem a year later.
Interestingly enough, though, I got all the credit for finding the problem even though the two guys who recognized my description did all the real work. I didn't point out that out. I mean, I looked like a genius. Nothing wrong with that.
Peter Vogel is a system architect and principal in PH&V Information Services. PH&V provides full-stack consulting from UX design through object modeling to database design. Peter tweets about his VSM columns with the hashtag #vogelarticles. His blog posts on user experience design can be found at http://blog.learningtree.com/tag/ui/.