News

Livermore Lab Pioneers Debugging Tool

How do you find a bug in a program when that program is spread across 200,000 processors?

As incredible as that scenario might sound, it is becoming a routine problem for Lawrence Livermore National Laboratory, home of the 212,992-core BlueGene/L supercomputer. To help spot bugs, laboratory researchers, along with those from the University of Wisconsin, have developed a new software program, named Stack Trace Analysis Tool (STAT).

"What we are finding is that today's architectures require novel [debugging] techniques," said Lawrence Livermore researcher Gregory Lee, who presented a paper about the new software at the recent SC08 conference in Austin.

Such debugging may become more crucial in years to come, as the largest petascale-systems might soon consist of at least 1 million cores.

Lee noted that many full-featured debuggers for parallel processor-based programs are already on the market, such as TotalView Technologies' TotalView. Such parallel debugging tools do not scale well for programs that run across thousands of processors because they cannot complete analyses within a reasonable amount of time. Such tools' thoroughness slows them down when they work on too many processors -- the data structures they create grow too unwieldy.

"Even if your tool works with today's scales, if you take that same application and add one or two orders of magnitude, then some of the things you do now may not work well," Lee said.

The open-source STAT is not a full-featured debugger. It can encircle the problem area within a large parallel program, and more-thorough commercial debuggers then fix the problem.

"We wanted to develop lightweight tools that would help the heavyweight tools by identifying processes that behave in a similar fashion," Lee said.

STAT takes advantage of the fact that most parallel applications run similar processes across multiple nodes. Most debuggers can show each and every process. When analyzing thousands of processors, it would be too difficult for the developer to sort through all those processes even if the debugger could generate all that information in a reasonable amount of time.

STAT works by collapsing identical processes into a single visual representation. The software program gathers information about all the processes running and then merges them into a tree graph. It also offers the option of building a 3-D graph tree, which can show the program running over a period of time. Both approaches are good at locating weaknesses in unstable programs, such as deadlocking.

In one test using BlueGene/L, the research team was able to merge all 212,992 processes of a program into a single graph tree in about of a third of a second. "If you interpolate those results to a machine with 1 million cores, you're still talking about latencies that are tolerable," Lee said.

The Lawrence Livermore BlueGene/L support team has just installed STAT for production debugging use, Lee said. Users can deploy STAT alongside the laboratory’s copy of TotalView to vector and remediate code bugs. "We ran it on a couple of real end-cases," Lee said.

About the Author

Joab Jackson is the chief technology editor of Government Computing News (GCN.com).

comments powered by Disqus

Featured

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

  • What's New for Python, Java in Visual Studio Code

    Microsoft announced March 2024 updates to its Python and Java extensions for Visual Studio Code, the open source-based, cross-platform code editor that has repeatedly been named the No. 1 tool in major development surveys.

Subscribe on YouTube