You can now follow me on Twitter at this address:
http://www.twitter.com/MohamedFAhmed
Hopefully I'll be able to do it often enough !
Cheers
Wednesday, November 11, 2009
Wednesday, November 4, 2009
Challenges of Multi & Many-Cores Microprocessors
Multi- and Many-Core processors are here to stay for a really long time. They are microprocessors manufacturers response to the uni-core scalability walls. However, although software communities explored the parallel programming models heavily in the 80s and 90s, but these efforts were directed to less finer grained systems, mainly clusters, parallel machines, and Symmetric Multi Processors (SMP). Multi- and many-core architectures poped to the surface some classical problems, such as memory latency, data synchronization, and threads management. Also, they introduced new problems of massively parallel systems with to a great extent fine grained threading models, such as managing thousands of concurrent threads and inter-thread communication and data sharing. In this posting I would like to pinpoint some of these challenges from my research programming experiences on multi-core architectures.
Maintaining the current increase rate of processing power requires from micro-processors designers to introduce more processing cores per microprocessor. However, two important sides to be considered as more cores are introduced. First, processor cores will increase in high rate to double the speed every 18 months and keep Moore’s law in effect. Hence, it is expected to have, within five years, many cores processors with tens or even hundreds of processing cores on the same chip. Second, as number of cores is increasing, they will be with simpler designs and achieving simple tasks and each core will be faster from current single-core processors. Power and heat management issues will impose such design constraints on micro-processors manufactures. Such design aspects will increase overall processor’s speed while maintaining reasonable power consumption and heat dissipation.. As a result, parallelism will be finer grained. Developers will parallelize their applications at a more fine grained level to take full advantage of the multi or many-cores advancements. This granularity will increase the contention among these threads on shared resources. These resources can be a memory location, i.e. data, or an I/O device. Interdependencies among these threads will increase. In addition, as cores are getting simpler and faster, more data will be moving back and forth between processor and system's main memory. On the other side, memory latency ration is increasing. Using hardware based techniques to hide this latency, such as branch prediction and embedded algorithms for cache replacement, may not be lucrative and efficient enough to hide this latency for parallel applications. Software based cache management and execution scheduling are now vital to fully utilize multi-core processors. Finally, programming complexity of multi-core processors and inherit complexity of parallel applications require tools to reduce some of these complexities.
Memory Latency Wall
As processors and programs become more parallelized, they will be more data hungry. On the other hand, the number of processor cycles to access system's main memory grew from few cycles in 1980 to almost a thousand cycles today. Moreover, the cache per core ratio will continue to go down, which will make the memory latency problem worse if cache not managed properly. Although there is a great potential in the DRAM based memory to increase performance, but the growth rate of processors aggregate cycles will continue to be faster. The processor-to-memory performance gap is expected to grow by 50% per year according to some estimates. The good news is that memory latency problem can be solved using efficient software based scheduling for memory access. Multi-core processors are now returning some control back to software developer to manage each core's cache. Such explicit cache management capabilities provide more space for programmers to maneuver around the processor-to-memory performance gap.
Data Synchronization
Using current synchronization mechanisms to synchronize tens or hundreds of threads access to one resource may lay on the line application’s performance. The whole system or application may suffer from deadlock or starvation due to weak synchronization mechanisms. In worst cases, current synchronization mechanisms will serialize the application in areas that need access to shared resources. As number of hardware threads in multi-core processors and parallelism increase in applications, the resulting performance lose increase as well. For example, implementing parallel shared counting algorithm would require from each of the participating threads to lock the counter before incrementing it. In worst performance case, each thread will have to wait for n-1 threads before it can update the counter, where n is the number of threads. If these processors are on the same die, efficiency of data synchronization can be greatly enhanced if data communication is done using available on-chip facilities, such as cores interconnect, shared cache, etc. Current synchronization techniques are using system's main memory to write and read shared data, which makes it even worse. Such technique introduces the memory latency delay in addition to the delay of synchronization algorithms.
Programming Complexity
Parallel computing is inherently complex mainly due to the difficulty of design and intricacy of resources sharing and synchronization. Presence of multi-core processors at different scales, starting from embedded systems to super computers, made application's adaptations to these new hardware platforms a critical issue. However, as multi-core processors are increasing their cores and parallelism is becoming more fine-grained, complexity will increase as well. Instead of designing a parallel application with 10 or 20 concurrent threads, an application may be executing 100s or 1000s of threads working on the same machine. A solution is required in this case to help reducing the programming complexity and also providing excellent scaling for the number of working threads.
Actually, these challenges are the main inspirational pillars for most of multi-core researchers, architects and developers. All microprocessors manufacturers are after faster processors without increasing programming complexity and without loosing developers ability to make best use of their new architectures. That's why microprocessors manufacturers are now involved aggressively in the programming models. Intel for example created Intel's Parallel Studio (Open CT framework included) for their general purpose multi-core microprocessors and specialized one as well, such as Larrabee GPGPU. Also, ATI built ATI Stream framework and NVIDIA also built CUDA framework to help developers make the best out of these new microprocessors without getting into the nitty-gritty architectural details of these advanced GPGPUs.
Maintaining the current increase rate of processing power requires from micro-processors designers to introduce more processing cores per microprocessor. However, two important sides to be considered as more cores are introduced. First, processor cores will increase in high rate to double the speed every 18 months and keep Moore’s law in effect. Hence, it is expected to have, within five years, many cores processors with tens or even hundreds of processing cores on the same chip. Second, as number of cores is increasing, they will be with simpler designs and achieving simple tasks and each core will be faster from current single-core processors. Power and heat management issues will impose such design constraints on micro-processors manufactures. Such design aspects will increase overall processor’s speed while maintaining reasonable power consumption and heat dissipation.. As a result, parallelism will be finer grained. Developers will parallelize their applications at a more fine grained level to take full advantage of the multi or many-cores advancements. This granularity will increase the contention among these threads on shared resources. These resources can be a memory location, i.e. data, or an I/O device. Interdependencies among these threads will increase. In addition, as cores are getting simpler and faster, more data will be moving back and forth between processor and system's main memory. On the other side, memory latency ration is increasing. Using hardware based techniques to hide this latency, such as branch prediction and embedded algorithms for cache replacement, may not be lucrative and efficient enough to hide this latency for parallel applications. Software based cache management and execution scheduling are now vital to fully utilize multi-core processors. Finally, programming complexity of multi-core processors and inherit complexity of parallel applications require tools to reduce some of these complexities.
Memory Latency Wall
As processors and programs become more parallelized, they will be more data hungry. On the other hand, the number of processor cycles to access system's main memory grew from few cycles in 1980 to almost a thousand cycles today. Moreover, the cache per core ratio will continue to go down, which will make the memory latency problem worse if cache not managed properly. Although there is a great potential in the DRAM based memory to increase performance, but the growth rate of processors aggregate cycles will continue to be faster. The processor-to-memory performance gap is expected to grow by 50% per year according to some estimates. The good news is that memory latency problem can be solved using efficient software based scheduling for memory access. Multi-core processors are now returning some control back to software developer to manage each core's cache. Such explicit cache management capabilities provide more space for programmers to maneuver around the processor-to-memory performance gap.
Data Synchronization
Using current synchronization mechanisms to synchronize tens or hundreds of threads access to one resource may lay on the line application’s performance. The whole system or application may suffer from deadlock or starvation due to weak synchronization mechanisms. In worst cases, current synchronization mechanisms will serialize the application in areas that need access to shared resources. As number of hardware threads in multi-core processors and parallelism increase in applications, the resulting performance lose increase as well. For example, implementing parallel shared counting algorithm would require from each of the participating threads to lock the counter before incrementing it. In worst performance case, each thread will have to wait for n-1 threads before it can update the counter, where n is the number of threads. If these processors are on the same die, efficiency of data synchronization can be greatly enhanced if data communication is done using available on-chip facilities, such as cores interconnect, shared cache, etc. Current synchronization techniques are using system's main memory to write and read shared data, which makes it even worse. Such technique introduces the memory latency delay in addition to the delay of synchronization algorithms.
Programming Complexity
Parallel computing is inherently complex mainly due to the difficulty of design and intricacy of resources sharing and synchronization. Presence of multi-core processors at different scales, starting from embedded systems to super computers, made application's adaptations to these new hardware platforms a critical issue. However, as multi-core processors are increasing their cores and parallelism is becoming more fine-grained, complexity will increase as well. Instead of designing a parallel application with 10 or 20 concurrent threads, an application may be executing 100s or 1000s of threads working on the same machine. A solution is required in this case to help reducing the programming complexity and also providing excellent scaling for the number of working threads.
Actually, these challenges are the main inspirational pillars for most of multi-core researchers, architects and developers. All microprocessors manufacturers are after faster processors without increasing programming complexity and without loosing developers ability to make best use of their new architectures. That's why microprocessors manufacturers are now involved aggressively in the programming models. Intel for example created Intel's Parallel Studio (Open CT framework included) for their general purpose multi-core microprocessors and specialized one as well, such as Larrabee GPGPU. Also, ATI built ATI Stream framework and NVIDIA also built CUDA framework to help developers make the best out of these new microprocessors without getting into the nitty-gritty architectural details of these advanced GPGPUs.
Monday, November 2, 2009
Multi-Core Programming Experiences
In my previous writing I tried to characterize multi-core processors and quickly pinpoint the major distinction features in multi-core processors. In this blog post I would like to briefly share with you some of the untapped aspects inside the Cell Broadband Engine and The GeForce GTX 295.
The Cell Broadband Engine (CBE)
The Cell microprocessor is an innovative heterogeneous multi-core processor built by IBM, Sony, and Toshiba. 400 designers worked closely together for 5 years to invent the new heart for the PlayStation3 (PS3). The PS3 was only a starting point for the CBE. IBM made the first stab and re-introduced the CBE as a highly capable microprocessor for compute intensive jobs.
I won't repeat the explanation of the CBE architecture. You can find it at Wikipedia, IBM's website, and many other places if you Google it. I implemented several discrete algorithms, such as graphs traversals and integers sorting, and scientific algorithms as the FFT. I also was able to get into its architectural details and build a proprietary threading model, called micro-threading. This framework hides memory latency in most workloads without engaging the developer into the lowest architectural details of the Cell processor. I also made a series of experiments to characterize the effect of it architectural properties on memory latency pattern in different workloads.
The beauty of the CBE, from my points of view, is in the great extent of control it gives you to reach the best execution time. All the other architectural properties regarding its compute capabilities, multiple level of parallelism, and its on-chip network discussed in many publications and I experienced all these great features during my PhD journey. However, I would like to note the following from some of my experiences.
Although the Element Interconnect Bus (EIB) is of high performance and very low latency, the effect of this topology over cores performance is overlooked by most research efforts. For example, from my memory latency measurement experiments I found out that memory latency differs from one core to the other depending on how far this core is from the memory controller. This does not reflect a flaw in the design, but this is a property that would change the measure of memory latency per each core. As this topology is used more and have more cores connected to the same ring, the physical location effect will be of more importance. The question is: how would this affect my performance as long as I can use techniques such as multi-buffering and data prefetching? The answer is very simple; as long as the memory latency differs from one core to the other, you need to hide it according to each core's latency measures. For example, in cores with relatively very high memory latency you can use more buffers to prefetch your data compare to other cores with lower memory latency. I already discussed this in my optimum micro-threading scheduling paper, please have a quick look at it to understand this issue more.
Also, in the new implementation of the CBE can work now with larger RAM. However, this RAM is based on the DDR3 technology. The initial implementation using the XDR RAM had the limitation of a maximum memory of 1 GB. Although the new CBE implementation has a faster floating point unit and two memory controllers to keep high processor-memory bandwidth, but the memory latency is getting higher due to the high latency of the DDR3 RAM compared to the XDR implementation.
NVIDIA GeForce GPGPU
I worked also on NVIDIA GeForce GTX 295 to implement some algorithms in information retrieval. It is still a work in progress and to be published. However, from my insights of the Cell Broadband Engine, I figured out some architectural properties that worth also sharing with you.
NVIDIA GPUs programming framework is abstracting to a great extent the architectural details of the microprocessor. From productivity point of view, this is a great feature. However, it is leaving few options for researchers and experienced developers to explore different options inside it. For example, I couldn't find a clear way that would help me scheduling threads to processor's cores. It is done, not sure yet, by the processor's scheduler. It is following the same policy used by Intel's hyper-threading and Sun's multi-threading architecture. I'm now measuring if memory latency differs from one core to another. My initial measures show that memory latency is almost the same across all cores. However, I'm a little bit concerned about the relatively many hierarchies built into the processor. I realize that the shared memory model mandates such hierarchy to have reasonable synchronization overhead. However, adding hierarchies to run away from this problem is not the best solution. NVIDIA is still investing in this hierarchical through their new GPU processor, Fermi.
Although the multi-core processors are providing a smart escape from physical limitations of the uni-core processors, but they need thorough architectural analysis to best utilize their resources. I think this is attainable by monitoring different execution patterns and pinpointing bottlenecks. This should make it easier build efficient programming models, run-time systems, and algorithms for multi- and many-core microprocessors. For example, inside the Cell Broadband Engine microprocessor manual cache management and the Element Interconnect Bus (EIB) formed the opportunity to build run-time systems to get best performance while simplifying the programming model, such as micro-threading, MPI microtask, data prefetching. I think multi- and many-cores processors will evolve through a closed feedback loop, see below.
Whenever a new architecture is being introduced, developers and researchers start implementing different algorithms and applications to get the best out of it. However, performance bottlenecks pops to the surface very soon. Several efforts try to solve these bottlenecks through either tweaking implemented algorithms or building general frameworks that would identify at run-time performance degradation parameters and change them. On the other hand the loop is properly closed if microprocessor designers listen to developers notes and try to hide these bottlenecks through architectural enhancements. This I think was properly handled in the new Fermi architecture of NVIDIA's GPGPUs. For example, Fermi has now multiple memory controllers each is handling requests of different groups. This should reduce effect of memory requests serialization, which is a serious performance bottleneck.
I believe research teams are moving now from the naive ways of speeding up multi- and many-core processors by doing the old tricks of algorithmic enhancements to digging more into the processor's architectural features and proposing better programming models backup by run-time libraries and frameworks. This trend is blending compilers, operating systems, parallel programming, and microprocessors architecture together.
NVIDIA GPUs programming framework is abstracting to a great extent the architectural details of the microprocessor. From productivity point of view, this is a great feature. However, it is leaving few options for researchers and experienced developers to explore different options inside it. For example, I couldn't find a clear way that would help me scheduling threads to processor's cores. It is done, not sure yet, by the processor's scheduler. It is following the same policy used by Intel's hyper-threading and Sun's multi-threading architecture. I'm now measuring if memory latency differs from one core to another. My initial measures show that memory latency is almost the same across all cores. However, I'm a little bit concerned about the relatively many hierarchies built into the processor. I realize that the shared memory model mandates such hierarchy to have reasonable synchronization overhead. However, adding hierarchies to run away from this problem is not the best solution. NVIDIA is still investing in this hierarchical through their new GPU processor, Fermi.
Although the multi-core processors are providing a smart escape from physical limitations of the uni-core processors, but they need thorough architectural analysis to best utilize their resources. I think this is attainable by monitoring different execution patterns and pinpointing bottlenecks. This should make it easier build efficient programming models, run-time systems, and algorithms for multi- and many-core microprocessors. For example, inside the Cell Broadband Engine microprocessor manual cache management and the Element Interconnect Bus (EIB) formed the opportunity to build run-time systems to get best performance while simplifying the programming model, such as micro-threading, MPI microtask, data prefetching. I think multi- and many-cores processors will evolve through a closed feedback loop, see below.
Whenever a new architecture is being introduced, developers and researchers start implementing different algorithms and applications to get the best out of it. However, performance bottlenecks pops to the surface very soon. Several efforts try to solve these bottlenecks through either tweaking implemented algorithms or building general frameworks that would identify at run-time performance degradation parameters and change them. On the other hand the loop is properly closed if microprocessor designers listen to developers notes and try to hide these bottlenecks through architectural enhancements. This I think was properly handled in the new Fermi architecture of NVIDIA's GPGPUs. For example, Fermi has now multiple memory controllers each is handling requests of different groups. This should reduce effect of memory requests serialization, which is a serious performance bottleneck.
I believe research teams are moving now from the naive ways of speeding up multi- and many-core processors by doing the old tricks of algorithmic enhancements to digging more into the processor's architectural features and proposing better programming models backup by run-time libraries and frameworks. This trend is blending compilers, operating systems, parallel programming, and microprocessors architecture together.
Subscribe to:
Posts (Atom)