Monday, November 2, 2009

Multi-Core Programming Experiences



In my previous writing I tried to characterize multi-core processors and quickly pinpoint the major distinction features in multi-core processors. In this blog post I would like to briefly share with you some of the untapped aspects inside the Cell Broadband Engine and The GeForce GTX 295.


The Cell Broadband Engine (CBE)
The Cell microprocessor is an innovative heterogeneous multi-core processor built by IBM, Sony, and Toshiba. 400 designers worked closely together for 5 years to invent the new heart for the PlayStation3 (PS3). The PS3 was only a starting point for the CBE. IBM made the first stab and re-introduced the CBE as a highly capable microprocessor for compute intensive jobs.
I won't repeat the explanation of the CBE architecture. You can find it at Wikipedia, IBM's website, and many other places if you Google it. I implemented several discrete algorithms, such as graphs traversals and integers sorting, and scientific algorithms as the FFT. I also was able to get into its architectural details and build a proprietary threading model, called micro-threading. This framework hides memory latency in most workloads without engaging the developer into the lowest architectural details of the Cell processor. I also made a series of experiments to characterize the effect of it architectural properties on memory latency pattern in different workloads.
The beauty of the CBE, from my points of view, is in the great extent of control it gives you to reach the best execution time. All the other architectural properties regarding its compute capabilities, multiple level of parallelism, and its on-chip network discussed in many publications and I experienced all these great features during my PhD journey. However, I would like to note the following from some of my experiences.
Although the Element Interconnect Bus (EIB) is of high performance and very low latency, the effect of this topology over cores performance is overlooked by most research efforts. For example, from my memory latency measurement experiments I found out that memory latency differs from one core to the other depending on how far this core is from the memory controller. This does not reflect a flaw in the design, but this is a property that would change the measure of memory latency per each core.  As this topology is used more and have more cores connected to the same ring, the physical location effect will be of more importance. The question is: how would this affect my performance as long as I can use techniques such as multi-buffering and data prefetching? The answer is very simple; as long as the memory latency differs from one core to the other, you need to hide it according to each core's latency measures. For example, in cores with relatively very high memory latency you can use more buffers to prefetch your data compare to other cores with lower memory latency. I already discussed this in my optimum micro-threading scheduling paper, please have a quick look at it to understand this issue more.
Also, in the new implementation of the CBE  can work now with larger RAM. However, this RAM is based on the DDR3 technology. The initial implementation using the XDR RAM had the limitation of a maximum memory of 1 GB. Although the new CBE implementation has a faster floating point unit and two memory controllers to keep high processor-memory bandwidth, but the memory latency is getting higher due to the high latency of the DDR3 RAM compared to the XDR implementation.


NVIDIA GeForce GPGPU
I worked also on NVIDIA GeForce GTX 295 to implement some algorithms in information retrieval. It is still a work in progress and to be published. However, from my insights of the Cell Broadband Engine, I  figured out some architectural properties that worth also sharing with you.
NVIDIA GPUs programming framework is abstracting to a great extent the architectural details of the microprocessor. From productivity point of view, this is a great feature. However, it is leaving few options for researchers and experienced developers to explore different options inside it. For example, I couldn't find a clear way that would help me scheduling threads to processor's cores. It is done, not sure yet, by the processor's scheduler. It is following the same policy used by Intel's hyper-threading and Sun's multi-threading architecture. I'm now measuring if memory latency differs from one core to another.  My initial measures show that memory latency is almost the same across all cores. However, I'm a little bit concerned about the relatively many hierarchies built into the processor. I realize that the shared memory model mandates such hierarchy to have reasonable synchronization overhead. However, adding hierarchies to run away from this problem is not the best solution. NVIDIA is still investing in this hierarchical through their new GPU processor, Fermi.


Although the multi-core processors are providing a smart escape from physical limitations of the uni-core processors, but they need thorough architectural analysis to best utilize their resources. I think this is attainable by monitoring different execution patterns and pinpointing bottlenecks. This should make it easier build efficient programming models, run-time systems, and algorithms for multi- and many-core microprocessors. For example, inside the Cell Broadband Engine microprocessor manual cache management and the Element Interconnect Bus (EIB) formed the opportunity to build run-time systems to get best  performance while simplifying the programming model, such as micro-threading, MPI microtask, data prefetching. I think multi- and many-cores processors will evolve through a closed feedback loop, see below.






Whenever a new architecture is being introduced, developers and researchers start implementing different algorithms and applications to get the best out of it. However, performance bottlenecks pops to the surface very soon. Several efforts try to solve these bottlenecks through either tweaking implemented algorithms or building general frameworks that would identify at run-time performance degradation parameters and change them. On the other hand the loop is properly closed if microprocessor designers listen to developers notes and try to hide these bottlenecks through architectural enhancements. This I think was properly handled in the new Fermi architecture of NVIDIA's GPGPUs. For example, Fermi has now multiple memory controllers each is handling requests of different groups. This should reduce effect of memory requests serialization, which is a serious performance bottleneck.

I believe research teams are moving now from the naive ways of speeding up multi- and many-core processors by doing the old tricks of algorithmic enhancements to digging more into the processor's architectural features and proposing better programming models backup by run-time libraries and frameworks. This trend is blending compilers, operating systems, parallel programming, and microprocessors architecture together.