Friday, November 13, 2009

Nano-kernels for the Era of Exascale Computing

I'm talking here again about the multi-core processors for massively parallel systems working on complex scientific applications. However, I'm tackling this area from a different perspective. I would like to think with you of how multi-core processors will look like in five years from now and think of these questions: What are the serious problems that these processors will suffer from (from system's perspective)? Which of the current solutions or anticipated frameworks may help us solving these problems? I’ll be discussing only one of them here. It is very difficult to predict accuratly what technology advancements will take place in the comming five years. However, there are general trends that we can track and resonably predict their future effects.

Anyways, I mentioned before that multi-core processors will be of thousands of cores maybe even tens of thousands of cores (check the latest AMD GPGPU Radeon HD 5970). These cores will be very simple and solving the same problem but on different data chunks. This of course mandates the existance of shared resources and shared areas contain input data and be able to store results. The path from one core to the data storage and other shared resources will get more complex and involving more shared resources, such as more hirarchies of on-chip and off-chip caches, cores interconnections, I/O buses, etc. The anticipated path and hierarchies through which data will be traveling to reach sysem's main memory or to rech core's registers will have a very important effect on data movement latency. It is not about only having higer latency which can be hidden by many veified techniques. It is about the variance of this latency from one core to another and from one request to anothr inside the same core. Current software solutions, such as prefetching in multiple buffers depend on the fact that latency to move data from memory to all processor's cores will be the same at run-time. However, this is not true even in current multi-core processors. For example, inside the Cell Broadband Engine, the DMA latency (or memory latency) differs from one core to another depending on its physical location inside the chip and how far it is from memory controller. This variance will be even bigger as these processors grow in number of cores and as contension increases on shared resources inside them. Such variance requires solutions that would hide memory latency dynamically at run-time inside each core based on each specific core's data latency. Current software solutions, such as prefetching and multi-buffering are depending on constant memory latency across all cores.
Some hardware-based solutions tried to solve this problem through hyper- or multi-threading. Inside multi-core processors with multi-threaded feature, once a thread is blocked for an I/O or data movement, another thread gets active and resumes execution. Sun Microsystems through its latest UltraSparc T2 & T2 Plus, added up to four threads per core, which gives at the end large number of virtually concurrent threads on the same chip. However, there are two important drawbacks. First, if memory latency is pretty low for any reason these threads will be spending most of their time switching, which would give at the end a semi serial performance because four threads are sharing the same ALU and FP units. On the other hand, if memory latency is really high for all of the working threads inside the same core, we may end up with idle time because all of them will be waiting for data to come from system's memory or an I/O device. Second, it this solution adds complexity to the hardware and consume space that could be used for bigger cache or even more single threaded cores.



Nano-Kernels
Ok, what would be the solution then? If we could dynamically create a threading framework that can create and manage threads pretty much to hyper- or multi-threaded architectures, we may be able to solve data latency problem smartly and for massively parallel multi-core processors. As long as each core will have its own data latency, why don’t we create small software threads that would switch their context to core's local cache instead of switching it to the system's main memory or second level cache. The context in this case will be the core's registers (pretty much similar to current hardware based multi-threaded architectures) and few control registers affecting the execution of the thread, such as the program counter. So whenever one of these small threads, let's call them micro-threads, stalls for a data chunk to be copied from system's main memory, it will go to sleep mode and another micro-thread is switched to running mode and resume execution. A very small and very fast kernel, we may call it a nano-kernel, should actively run inside these cores to schedule micro-threads and make sure that data movement latency is hidden almost completely inside each core. This idea of having micro-threads has two advantages. First, the number of micro-threads is dynamic, which means number of micro-threads depends on data movement latency. For example, in large data movement latency we may add more micro-threads per core to work longer during other micro-threads wait time for data to be ready in core's cache. Second, context switching inside each core's cache makes it very cheap and very fast process, i.e. few of nano seconds. Of course, context saving will consume from each core's cache but this is already consumed by several magnitudes to implement the hardware based multi-threaded architectures. Also, this will require specific faciltities provided by the ISA. For example, manual cache management and internal interrupting facility inside each core are mandatory for this idea to work.
So, if we create nano-kernels doing optimization inside each core, we would reach new performance ceilings. It is scalable since each core has its own nano-kernel working independently and scheduling micro-threads based on resources given to the core. So, with thens of throusand of threads this solution would still work and get the most out of the expected massively parallel multi-core processors.