DevDisasters
The $100,000 Deadline
A critical auditing app was as down as the floor of the Grand Canyon. And each day it was down was a six-figure fine.
Roger A. cursed to nobody in particular as he swiped and then re-swiped his badge over the sensors at the main entrance of the AuditTech offices. After weeks of normal activity, something had gone terribly wrong, prompting an "all hands on deck" crisis meeting on the one day he had scheduled off (three weeks in advance) to get a root canal. Granted, he wasn't super-thrilled about facing either scenario, but make no mistake: He'd rather be facing the pain that came with anesthetic.
Roger navigated past row after row of cubicles until he reached the large conference room area. In a room that usually only contained a simple oak table and chairs for 12, Roger found it filled with at least twice that number of people, with several more spilling out into the halls. The ones that did have a spot at the table seemed to be in the midst of a very loud and heated teleconference.
"Roger! Here, pull up a chair if you can find one. Glad you could make it in."
Upon hearing this, someone at the table not invested in the phone debate offered up their seat and headed for the hallway. By this point in the morning, it was a good 10 degrees cooler in the hallway, owing to the number of laptops in use in the conference room.
"Hey, Jim, what's this all about? I just saw something in the invite about the field auditing app being down and ..."
Jim cut in, "Hang on. I'll show you."
Indisputable Proof
In a flash, Jim switched to a remote desktop session with a single app in focus. On his 15-inch laptop screen, the UI looked like it was designed for someone who had terminal farsightedness, but in reality, it was perfect for the auditors in the field who'd be using it on a screen a third that size.
Each one of their handheld computers ran a minimal version of Windows 7 that ran off a flash card. It booted up in seconds and was designed to be pretty much indestructible. The way the process worked was the auditor would receive a scanning assignment, look for a given piece of inventory, and scan an inventory's barcode tag using the device's attached scanning gun. The app would read that data in, parse out the important fields and store it locally to be later uploaded to the central database. The user would then receive new scanning instructions when the entire process was complete.
Unlike many of their competitors, AuditTech owned the entire end-to-end solution: hardware, software, and the process were all created and curated in-house, so when one step went sideways, it was no problem to get everybody together in the same room. The downside was that not everybody got along under those conditions.
Jim showed Roger A. the screen and shouted, "See? Two rows. Some of the guys in the field reported that they only got one back, if any! One guy got 'no assignments found' and thought it was his cue to punch out and go home -- in the middle of a fugduggering multi-warehouse site audit for crying out loud!!"
Jim caught himself. "Sorry ... it's been pretty messed up. Ever since the inventory work dataload at 5 a.m. today, nothing's been right – and, of course, they're on-site at the Parts Unlimited facility."
Roger's eyes widened. He suddenly realized why this was such a big deal. Parts Unlimited was his company's biggest and busiest customer. For it, time was truly money -- every day over the inventory audit deadline was $100,000.
Being the back-end systems project lead, Roger had a feeling this storm was about to head in his direction. In an uncharacteristic move for the people gathered in the room, Roger excused himself and made a beeline for his group.
Drop Everything
Roger skipped the normal social niceties and ended the meeting going on in his area's comparatively anemic conference room, then called in his team.
Once everybody was gathered, he explained the situation and repeated the steps Jim had performed just a short time earlier, showing how the auditors' inventorying application wasn't returning anywhere near as many rows as it should.
"So, ladies and gentlemen, now that you can see what's going on, who can give me some ideas as to how we can remedy this?"
Nearly everybody had a suggestion to offer. "Did you try rebooting the SQL box?" "Did someone check the IDSPISPOPD or IDCLIP processes?" "Are the handhelds up-to-date?" These, plus different variations, came out. They were all good suggestions, but no real answers were forthcoming.
Then, one of the devs offered up what seemed to be the most useful idea: "Hey, maybe there's something in the code?"
Roger couldn't tell by his tone if the guy was being sarcastic or actually trying to be helpful, but either way, seconds later, his laptop was driving the meeting with an open view into the corporate Team Foundation System (TFS) repository.
Fifth Time's the Charm
"Ok, so after the application passes a query to ReadDataBase to execute -- doesn't matter what it is -- it just runs the thing."
private int ReadDataBase(string Query)
{
try
{
for (int i = 0; i < 5; i++)
{
if (DatabaseRead(Query) == 0)
return 0;
DateTime dt = DateTime.Now;
while (DateTime.Now.Ticks - dt.Ticks < 6000000)
{
Application.DoEvents();
}
}
return -1;
}
catch (Exception E)
{
Log("Detail : " + E.ToString());
return -1;
}
}
private int DatabaseRead(string Query)
{
try
{
// Opening connection
MyMutDB.WaitOne();
if (string.IsNullOrEmpty(ConnectionString)) throw new Exception("Invalid Connection string");
if (OleDBConnection == null)
{
OleDBConnection = new OleDbConnection();
OleDBConnection.ConnectionString = ConnectionString;
}
OleDBConnection.ResetState();
if (OleDBConnection.State != ConnectionState.Open)
{
OleDBConnection.Open();
}
}
catch (System.Data.OleDb.OleDbException E)
{
MyMutDB.ReleaseMutex();
Log("Detail : " + E.ToString());
OleDBConnection = null;
return -1;
}
catch (Exception E)
{
MyMutDB.ReleaseMutex();
Log("Detail : " + E.ToString());
OleDBConnection = null;
return -1;
}
// Actually querying the database
try
{
if (DR != null)
DR.Close();
CommandeOleDB = new OleDbCommand(Query, OleDBConnection);
DR = CommandeOleDB.ExecuteReader();
MyMutDB.ReleaseMutex();
return 0;
}
catch (System.Data.OleDb.OleDbException E)
{
MyMutDB.ReleaseMutex();
Log("Detail : " + E.ToString());
Log("query : -" + Query + "-");
DR.Close();
OleDBConnection.Close();
DR = null;
OleDBConnection = null;
return -1;
}
catch (Exception E)
{
MyMutDB.ReleaseMutex();
Log("Detail : " + E.ToString());
Log("query : -" + Query + "-");
DR.Close();
OleDBConnection.Close();
DR = null;
OleDBConnection = null;
return -1;
}
}
"... return from DatabaseRead, and once that's over and done, kicks off a where loop of DoEvents that calculates an auditor's new assignments and ..."
"Hold on," Roger piped up, "Six million ticks? What's that?"
"I'll probably get this wrong, but I think there are 10,000 ticks in a millisecond ... so ... a little more than half a second."
"So, if I understand correctly, it'll run and DoEvents will calculate as many new scan events as it can in three seconds ... one-half second at a time?"
The developer nodded. "Oh, but that's the beauty of it. The process is so fast, we need to wait only that long, or else the auditor will have too much to scan at a given time."
Someone in the room added, "Yeah! It was part of the original spec."
"But why does it run five times?"
The developer shrugged. "Beats me, maybe it ran a few more times to get what someone figured was the perfect number of records. Of course, if the database call timed out you'd at least get something to scan if it was past the first loop iteration ..."
The developer kept speculating, and others joined in, but Roger had heard enough. He excused himself back to the crisis meeting and suggested that the data load from earlier in the day had bogged down the back-end process such that it was timing out.
The representatives from the business analyst and DBA teams hatched a plan to trim the list of inventory to scan in half, then replace it with the other half after it was fully scanned. It also turned out that this was the largest scan to date -- the fact that anything worked at all before was probably due to some miracle.
The solution sped things up considerably, and AuditTech dodged the deadline, because the code could be fixed once the chaos died down. Meanwhile, Roger rescheduled his dental appointment for a time in-between major inventory audits.
About the Author
Mark Bowytz is a contributor to the popular Web site The Daily WTF. He has more than a decade of IT experience and is currently a systems analyst for PPG Industries.