News

Livermore Lab Pioneers Debugging Tool

How do you find a bug in a program when that program is spread across 200,000 processors?

As incredible as that scenario might sound, it is becoming a routine problem for Lawrence Livermore National Laboratory, home of the 212,992-core BlueGene/L supercomputer. To help spot bugs, laboratory researchers, along with those from the University of Wisconsin, have developed a new software program, named Stack Trace Analysis Tool (STAT).

"What we are finding is that today's architectures require novel [debugging] techniques," said Lawrence Livermore researcher Gregory Lee, who presented a paper about the new software at the recent SC08 conference in Austin.

Such debugging may become more crucial in years to come, as the largest petascale-systems might soon consist of at least 1 million cores.

Lee noted that many full-featured debuggers for parallel processor-based programs are already on the market, such as TotalView Technologies' TotalView. Such parallel debugging tools do not scale well for programs that run across thousands of processors because they cannot complete analyses within a reasonable amount of time. Such tools' thoroughness slows them down when they work on too many processors -- the data structures they create grow too unwieldy.

"Even if your tool works with today's scales, if you take that same application and add one or two orders of magnitude, then some of the things you do now may not work well," Lee said.

The open-source STAT is not a full-featured debugger. It can encircle the problem area within a large parallel program, and more-thorough commercial debuggers then fix the problem.

"We wanted to develop lightweight tools that would help the heavyweight tools by identifying processes that behave in a similar fashion," Lee said.

STAT takes advantage of the fact that most parallel applications run similar processes across multiple nodes. Most debuggers can show each and every process. When analyzing thousands of processors, it would be too difficult for the developer to sort through all those processes even if the debugger could generate all that information in a reasonable amount of time.

STAT works by collapsing identical processes into a single visual representation. The software program gathers information about all the processes running and then merges them into a tree graph. It also offers the option of building a 3-D graph tree, which can show the program running over a period of time. Both approaches are good at locating weaknesses in unstable programs, such as deadlocking.

In one test using BlueGene/L, the research team was able to merge all 212,992 processes of a program into a single graph tree in about of a third of a second. "If you interpolate those results to a machine with 1 million cores, you're still talking about latencies that are tolerable," Lee said.

The Lawrence Livermore BlueGene/L support team has just installed STAT for production debugging use, Lee said. Users can deploy STAT alongside the laboratory’s copy of TotalView to vector and remediate code bugs. "We ran it on a couple of real end-cases," Lee said.

About the Author

Joab Jackson is the chief technology editor of Government Computing News (GCN.com).

comments powered by Disqus

Featured

  • VS Code Python Tool Does Multiple Interactive Windows

    Code cells from Python scripts by default will still be executed in a same interactive window, but developers can now configure the Python extension to run separate files in separate interactive windows.

  • VS Code Java Team Improves 'Getting Started' Experience

    Microsoft's dev team responsible for the Java on Visual Studio Code extensions released a new update that eases the "getting started" experience, addressing feedback from new users who want an easier onramp.

  • Data Prep for Machine Learning: Encoding

    Dr. James McCaffrey of Microsoft Research uses a full code program and screenshots to explain how to programmatically encode categorical data for use with a machine learning prediction model such as a neural network classification or regression system.

  • Surface Duo Debut Presents Dual-Screen Dev Challenges

    Microsoft officially launched its new dual-screen Android device, Surface Duo, presenting new challenges -- and opportunities -- for developers to leverage the new form factor.

  • What's New in Blazor Tooling Updates

    Here's a quick look at what four major third-party Blazor tooling vendors have offered lately for Microsoft's red-hot project that allows for web development with C# instead of JavaScript.

.NET Insight

Sign up for our newsletter.

Terms and Privacy Policy consent

I agree to this site's Privacy Policy.

Upcoming Events