Running on All Cores
The 1-2-3s of threading your apps for Windows.
Multicore is here to stay, but as exciting as parallelism is, we need to ask ourselves practical questions: "How are we going to use it?" and "What should I do about it?"
Anyone programming on a Web server is already writing to a model where individual requests are served in parallel. This transactional (process-based) parallelism is very effective. While we may one day work to speed up individual request responses, it's difficult today to find examples where it matters enough to bother. Servers make good use of multicore processors, and will for some time.
Process-based parallelism comes from running multiple, generally independent processes at once. If you want a single process to run faster on multicore processors, you need to break the process up into threads. Each thread runs independently but shares memory with other threads.
The operating system schedules threads essentially like processes, but they are understood to be part of one program; all threads in a process share the same address space.
Support for threading is still evolving for applications that spawn so many threads, and the overhead in memory to keep track of all of them is overwhelming. For Windows this number is in the few thousands -- far more than the number of processors you would see in a Windows system. Microsoft upgraded their thread pool API in Windows Vista to be simpler, more flexible and higher performance.
Deciding how to decompose your problem to use parallelism starts with one simple question: Is your program data-parallel or task-parallel?
Programs can use both styles -- sometimes at once -- to smoothly distribute the work among processor cores. Deciding which to use is the programmer's job.
.NET developers can plan for an easier and more gradual transition by thinking parallel now and learning from the C++ experience as it unfolds. Most C++ developers are using Windows threads directly and relying on the small amount of support basic tools supply.
Developers face several challenges in adding parallelism to their C++ applications, namely scalability, correctness and maintainability.
Figuring out why you do not scale as well as you'd like is all about finding cases where all threads in the application are not running full bore. Good visualization tools, which can show what threads are doing, are a developer's best friend when working on increasing scaling.
Amdahl's Law says given enough processors, you can speed up the parallel parts to take basically zero time, but the serial code you leave in your program will limit your speed-up. The solution is to eliminate as much serial code as possible: run in parallel all the time. This is far easier said than done.
Frankly, not all algorithms can be coded in parallel. That does not mean there's not a parallel algorithm for the task. People who "think parallel" will tell you the parallel algorithm that scales best is seldom the top serial algorithm, and vice-versa. Part of what you learn when you think parallel is to reject algorithms you can't figure out how to scale, in favor of algorithms that may seem inferior in some way, but scale well.
The best tools for scaling are abstractions where someone else did the work -- specifically libraries (which are threaded) or OpenMP (parallelism extensions supported by virtually all C++ compilers, including Intel and Microsoft compilers). You should ask suppliers of libraries you buy about their current support and future plans for parallelism. Select vendors that have a plan or your future will be hampered. If you sell software services, be prepared for your customers to ask about your plans for parallelism.
Incorrect synchronization of threads can cause problems that affect the successful deployment of parallel programs. Synchronization is generally done with "locks" -- data values in memory that serve as traffic lights indicating which threads can proceed. Generally, a thread locks data when it wants to read, modify and write the value without another thread touching the data. Other threads will wait until the lock is released to disturb or use the data.
Problems often arise from locks used along with libraries and recursion. Working with Windows developers on their experiences debugging parallel applications, we've found virtually every project gets something wrong. You need to look for solutions here -- tools to help -- and develop techniques and expertise to avoid issues.
Finally, you want to write code that is easy to read, understand and maintain. This is a key reason why C++ is more popular than assembly language programming.
C++ developers should look to add parallelism in three ways:
- Use threaded libraries and/or OpenMP. Many libraries are threaded to take advantage of parallelism, with widely varying levels of maturity.
- Use other lesser-known and emerging abstractions, such as template-based libraries or Microsoft's research compiler (http://research.microsoft.com/comega). Adding more abstractions needs to be a focus for all tool vendors.
- Explicitly use Windows threads (hand-coding) and manage them directly.
If you can't find enough in step one or two today and you end up hand-coding threads, try to use simple models like queuing. Microsoft has support in the operating systems for thread pooling with functions like QueueUserWorkItem.
In the end, thinking parallel is a mindset. It's not harder than serial thinking, just different. Application development will be more difficult because we don't yet have the tools, experience and support we need to make it easy. Experience is something that all application developers need to get more of so that we think parallel and build successful apps over the next decade.