News

Livermore Lab Pioneers Debugging Tool

How do you find a bug in a program when that program is spread across 200,000 processors?

As incredible as that scenario might sound, it is becoming a routine problem for Lawrence Livermore National Laboratory, home of the 212,992-core BlueGene/L supercomputer. To help spot bugs, laboratory researchers, along with those from the University of Wisconsin, have developed a new software program, named Stack Trace Analysis Tool (STAT).

"What we are finding is that today's architectures require novel [debugging] techniques," said Lawrence Livermore researcher Gregory Lee, who presented a paper about the new software at the recent SC08 conference in Austin.

Such debugging may become more crucial in years to come, as the largest petascale-systems might soon consist of at least 1 million cores.

Lee noted that many full-featured debuggers for parallel processor-based programs are already on the market, such as TotalView Technologies' TotalView. Such parallel debugging tools do not scale well for programs that run across thousands of processors because they cannot complete analyses within a reasonable amount of time. Such tools' thoroughness slows them down when they work on too many processors -- the data structures they create grow too unwieldy.

"Even if your tool works with today's scales, if you take that same application and add one or two orders of magnitude, then some of the things you do now may not work well," Lee said.

The open-source STAT is not a full-featured debugger. It can encircle the problem area within a large parallel program, and more-thorough commercial debuggers then fix the problem.

"We wanted to develop lightweight tools that would help the heavyweight tools by identifying processes that behave in a similar fashion," Lee said.

STAT takes advantage of the fact that most parallel applications run similar processes across multiple nodes. Most debuggers can show each and every process. When analyzing thousands of processors, it would be too difficult for the developer to sort through all those processes even if the debugger could generate all that information in a reasonable amount of time.

STAT works by collapsing identical processes into a single visual representation. The software program gathers information about all the processes running and then merges them into a tree graph. It also offers the option of building a 3-D graph tree, which can show the program running over a period of time. Both approaches are good at locating weaknesses in unstable programs, such as deadlocking.

In one test using BlueGene/L, the research team was able to merge all 212,992 processes of a program into a single graph tree in about of a third of a second. "If you interpolate those results to a machine with 1 million cores, you're still talking about latencies that are tolerable," Lee said.

The Lawrence Livermore BlueGene/L support team has just installed STAT for production debugging use, Lee said. Users can deploy STAT alongside the laboratory’s copy of TotalView to vector and remediate code bugs. "We ran it on a couple of real end-cases," Lee said.

About the Author

Joab Jackson is the chief technology editor of Government Computing News (GCN.com).

comments powered by Disqus

Featured

  • Windows Community Toolkit v8.2 Adds Native AOT Support

    Microsoft shipped Windows Community Toolkit v8.2, an incremental update to the open-source collection of helper functions and other resources designed to simplify the development of Windows applications. The main new feature is support for native ahead-of-time (AOT) compilation.

  • New 'Visual Studio Hub' 1-Stop-Shop for GitHub Copilot Resources, More

    Unsurprisingly, GitHub Copilot resources are front-and-center in Microsoft's new Visual Studio Hub, a one-stop-shop for all things concerning your favorite IDE.

  • Mastering Blazor Authentication and Authorization

    At the Visual Studio Live! @ Microsoft HQ developer conference set for August, Rockford Lhotka will explain the ins and outs of authentication across Blazor Server, WebAssembly, and .NET MAUI Hybrid apps, and show how to use identity and claims to customize application behavior through fine-grained authorization.

  • Linear Support Vector Regression from Scratch Using C# with Evolutionary Training

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the linear support vector regression (linear SVR) technique, where the goal is to predict a single numeric value. A linear SVR model uses an unusual error/loss function and cannot be trained using standard simple techniques, and so evolutionary optimization training is used.

  • Low-Code Report Says AI Will Enhance, Not Replace DIY Dev Tools

    Along with replacing software developers and possibly killing humanity, advanced AI is seen by many as a death knell for the do-it-yourself, low-code/no-code tooling industry, but a new report belies that notion.

Subscribe on YouTube