Practical .NET

Insider: A Debugging Story

Peter Vogel looks at debugging from a different specific: One incident whose resolution depended, in part, on understanding who'd already searched for the bug and the history of the bug itself.

I've been doing a lot of LINQ programming recently, so my next column is probably going to be called something like "Stupid LINQ Tricks".  Because LINQ, like SQL, is a declarative language you don't get a lot of code to look at when your code isn't doing what you want. But fortunately, much of your debugging effort shouldn't be obsessively examining code. I've even got a story that demonstrates the importance of understanding context, gathering evidence, doing an IS/is NOT analysis, and (the most important part) getting a good description of the bug.

I had been working at my first company for about a year when I got assigned "the negative inventory problem": For the previous two years, every once in a while an item in the company's inventory would show a negative balance. There was a report that ran every night that checked for these negative balances and it had found one for the first time in a couple of months.  Several people had tried to find the problem before -- now it was my turn.

My debugging process is simple: First, get a good description of the problem and generate some preliminary theories; second, stabilize the bug (make it happen any time I want); third, fix the bug.

I started building the description by finding all the places in our applications where the inventory count was updated. I found, as I remember, three routines that set the inventory count: one set from user input and two that set it from other data in the database. I didn't spend much time looking at the code because, if that's where the problem was, I figured one of the smart, clever people who'd looked at the bug before me would have already found the problem. It was, however, input to the description.

After looking at the code, I was ready to generate some preliminary theories. I had three:

  1. I didn't understand how the tools worked and wasn't reading the code correctly. This seemed the least likely because, while I was fresh out of school and didn't know much, the previous investigators probably did.
  2. The inventory count was being updated somewhere else and everyone had forgotten about it. This seemed pretty likely: the system had been around for a while and source control wasn't all it could be.
  3. Some of the data used to update the inventory balance weren't valid, and when used to update the inventory count, generated a negative number. This also seemed possible, but the system inputs made it unlikely there was bad data in the system.
Improving the Description

I needed to improve my description, so I started to do an "IS/is NOT" analysis: The bug occurs when this IS true/the bug does not occur when this is NOT true. For instance, I could have instrumented the three routines to have them write to a log whenever they executed. The next time that an inventory count went negative, I could check the log and see which routine(s) had executed on the inventory item that went negative. If none of them had executed, that would point to option two (the bug is NOT associated with any known routine) and if only one of the routines had executed, that would point to that routine (the bug IS associated with this routine).

Before I did that, I took advantage of a log we already had: the bug reports listing every incident of our negative inventory account. I graphed that data by date to see what associations I could find (for instance: the bug IS happening on Mondays; the bug is NOT happening on holidays). What I discovered was that the bug had appeared out of nowhere about two years ago and had been occurring virtually every day. Gradually, however, the incidences had tapered off until, now, we might have a month without one.

Option 3 (bad data) seemed more probable now. The best explanation I could come up with was that something had been done just before the bugs started appearing that had "seeded" the database with bad data. Initially, the amount of bad data had been high so we had an incident every day. But -- I was extending my theory here -- whatever generated the bad inventory count (or the clean-up process afterwards) also "fixed" the bad data. As a result, the amount of bad data was decreasing over time and it was taking longer to hit bad data. Since the bugs had started on a Monday, the culprit was probably some job job run over the weekend which bypassed the normal inputs.

Stabilize and Fix

As soon as I described the bug this way to the rest of the team, two of the senior programmers who knew enough of the history of the system flagged the most likely cause. In literally two or three minutes, the guys were able to identify the job that would have generated the bad data and the data that the job would have inserted.

The next step was to make the bug happen whenever we wanted: To stabilize it. The guys ran a program that found some of the remaining data inserted by the weekend job two years ago, entered a transaction that would flag some of that data for processing, and waited for the report to show a negative balance the next morning. And the next morning it did. The guys then wrote a new program that found all the remaining bad data and fixed it. We never had another incident and retired the program that checked for the problem a year later.

Interestingly enough, though, I got all the credit for finding the problem even though the two guys who recognized my description did all the real work. I didn't point out that out. I mean, I looked like a genius. Nothing wrong with that.

About the Author

Peter Vogel is a system architect and principal in PH&V Information Services. PH&V provides full-stack consulting from UX design through object modeling to database design. Peter tweets about his VSM columns with the hashtag #vogelarticles. His blog posts on user experience design can be found at http://blog.learningtree.com/tag/ui/.

comments powered by Disqus

Featured

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

  • What's New for Python, Java in Visual Studio Code

    Microsoft announced March 2024 updates to its Python and Java extensions for Visual Studio Code, the open source-based, cross-platform code editor that has repeatedly been named the No. 1 tool in major development surveys.

Subscribe on YouTube