Simulation May Be Key for Multicore Debugging
I hate race conditions. Developers of multithreaded applications who have bugs that are dependent upon the timing both within and between threads curse race conditions. They go against every instinct that a developer has about computers and software – they are, or at least appear to be, nondeterministic. It depends on when a particular thread finishes a particular task, and it can vary depending on the execution path taken, or the amount of data read and written. Or it can vary based on the time of day, or phase of the moon.
Just kidding about the last part, but to many developers it can seem like a reasonable statement.
Most of us have encountered bugs in clean builds that mysteriously disappear when debug information is added. This is often related to the race condition, in that debugging information tends to slow down execution, changing timings and getting rid of the problem. Of course, the problem is still there, but it cannot be found while running the debugger.
These were the thoughts that perked me up as I participated in a briefing from Paul McLellan of Virtutech (www.virtutech.com). Virtutech makes software, primarily for embedded systems development, that enables developers to simulate the underlying hardware and operating systems. This is especially important for embedded development, such as cell phone software, where the hardware may not even be ready when application development is in full swing.
But there are lessons here for PC developers, who are increasingly facing problems that can be addressed by debugging on simulated systems. The lessons come from, of all things, debugging software. You can run the simulator backwards, explained McLellan. Just get to the point where the bug occurs, then step backwards.
That is an extremely powerful debugging technique, because you do not have to guess at where to start stepping forward using conventional debugging techniques. But it does not end there. If you have a multicore system, McLellan continued. You can run one CPU at ten times the clock rate of the other. You can change the timing until the bug disappears.
Of course, this by itself does not find the bug for you, but it does give you one more tool in your arsenal, and it is a tool that isn't possible when you are debugging on real hardware. We would find good uses for such a tool in practical debugging situations.
I commented a few weeks ago (http://www.ftponline.com/weblogger/forum.aspx?ID=11&DATE=8/20/2006) that programming multiprocessor systems represents one of the fundamental problems of computer science for the foreseeable future, perhaps requiring new languages or new techniques that are not yet in common practice. (NB – The distinction I am making between multiprocessor and multicore systems is that in a multiprocessor system, the system has multiple complete processors on separate dies, whereas in multicore systems the processor has multiple CPUs on the same die).
One of those techniques may well be running multiprocessor and multicore application software on a simulated processor and operating system platform. Consider how this might work. You write a threaded application in which different threads are designed to run on different CPU cores. In the course of debugging, you come across an application crash that seems to occur almost randomly, even in different parts of the code.
On a simulated platform, you can do a debug build and engage the debugger. Then let the application run until the crash, and back up the execution of the simulator, watching thread activity and variable values on different cores as you normally would in the debugger, except in reverse. Once you've established that a race condition exists, you can speed up one of the processors until it disappears, confirming the condition and getting a better idea of the timing involved.
This will not make the code easier to write, of course. The developer is still responsible for locking resources and data from access and change by multiple threads, and for blocking thread execution when it can be harmful. But I cannot help but think that simulated platforms will play some role in multiprocessor and multicore application development in the future.
Posted by Peter Varhol on 09/18/2006 at 1:15 PM