Saturday, December 5, 2009

Which multi-core processor will get the job done?

Multi-core processors are supposed to give us a better performance because simply they are out there for that sole purpose. But the question is: Which one would be the best to get the job done?

Well, the definite answer for this question is: it depends!

From my experiences with different architectures the best payback is from what you really want from them. I’m not talking here about the end user experiences I’m trying to handle this question from the developer’s perspective, mainly for applications that are compute intensive, such as scientific applications, data intensive applications, and discrete algorithms with considerably complex computations.

So, if the answer is: it depends, I think it depends on:
  • Which problems your machine will be working on?
  • How fast you want the code to be written 
  • How much you would like to pay for the hardware/software
Of course life is much more complex to only consider what I write here as the only guidelines to make up your mind. I’m only pinpointing important factors for your decision. Also, please feel free to ask me questions end of this post if you feel that I’m missing other important factors. I will also write in my upcoming posts more related details that would help you. I’m assuming that you are new to multi-core programming and would like to get started and tap new domains of programming.

Which problems your machine will be working on?
If you are intending to solve problems that you need to do in parallel for huge data sets, I would recommend using simple multi-core processors such as the GPGPUs or the Cell Broadband Engine. For example, one core inside the Cell Broadband Engine, Synergistic Processing Element (SPE), can give you around 25.5 GFLOPS; meanwhile one Intel Xeon core can give you around 9.6 GFLOPs. Inside the Cell processor you have 8 cores totaling around 205 GLFLOPs available for you. Also if you consider the GPGPUs you can get up to 1700 GFLOPS per one GPU card compared to a total of 96 GFLOPS for the latest Quad-2-Core general purpose processor from Intel. This category of applications or algorithms includes, but not limited to: string matching and searching, sorting large data sets, FFT computations, data visualization, network traffic scanning, and artificial intelligence.

However, if you are developing applications that can work independently and perform a lot of I/O, you would select the general purpose multi-core processors such as the Power7, Intel’s Quad-Core, AMD Phenom I & II, or Sun’s Spark T1 & T2 processors. These processors are providing in some cases up to 32 concurrently running threads, combining both multi-core and multi-threaded architectures. For example, building a web application and distributing the workload over three tiers: web server, business logic, and the database layers, would perfrom better on these architectures. Each one of these layers is I/O oriented; in addition, each one works as independent process. Each core in such case can handle a layer efficiently without affecting the execution of other cores. In addition, synchronization among them is handled by the operating system through the inter-process communication APIs provided by the OS.

How fast you want to start coding?
When it comes to ease or difficulty of multi-core programming, there are three levels of difficulty. Each level is trading off the difficulty with performance. I’ll start with the easiest and most common one.
Conventional general purpose multi-core processors are the easiest to program and get running parallel applications on top of them. You can still use the old programming models using PThreads or OpenMP to program multi-threaded applications. You don’t have to study the underlying hardware. You will have to only refresh your old parallel programming knowledge. In addition, if you are developing coarse grained parallel applications. You can run it as independent processes and use the old synchronization APIs provided by the operating system. What this programming model is trading off are: (1) Limited scalability: you cannot create and easily manage a lot of threads (I’m talking about 100s of threads) because of the hardware and programming model limitations, (2) You don’t have a lot of space to maneuver and do architecture specific optimization; the cache and threads scheduling is done on behalf of you, which may not provide the best performance for your application. So you can spend few days reviewing your old knowledge and you will be ready to produce parallel applications out of these models.

The next level is the new programming models recently built on top of the GPGPUs such as CUDA and OpenCL. These models are built to combine both conventional serial programming models with the new kernel based parallel programming paradigm. These would allow you to write the program and input initialization code in a serial fashion as it used to be; and then offload the compute intensive part to the GPU card. The kernel function gets executed by many threads; each thread is running in its own context. You should use the synchronization primitives provided by the framework for these offload threads. Of course, these platforms are for applications with fine grained parallelism. The main advantages that this model may give you: (1) Automatic management of many threads, 100’s of threads; the framework will create, manage and destruct the threads contexts transparently; (2) Hiding some of the architectural complexities; for example, you don’t have to manage the cache or pipelines optimizations. The tradeoffs of these models are: (1) Ease of programming; it is now more difficult to program with these models since your application or algorithm must be aware of the architectural heterogeneity. Also you will have to understand the some of the architectural aspects of the GPU card, such as the memory hierarchy; however, these aspects are not as complex as in the third model, (2) some performance is lost due to programming model abstractions. For example, you can create threads more than the number of available cores; this may delay the overall execution if these threads are synchronizing through single or few shared variables. It is recommended to use this model if you have highly parallel problems and don’t want to focus on most of the architectural aspects. You may need a week or two to understand the model and architecture before you start programming.

The third level is challenging but provides many distinction points for you as a developer or researcher who selects this programming model. I see only the Cell Broadband Engine as the only processor in this category. You can still program your parallel application using PThreads or OpenMP. However, this is only to create and kill the threads. To synchronize and properly implement your algorithm you will have to understand all the architectural aspects of the microprocessor, such as the DMA for cache management inside the Cell processor. The main advantage that this model provides is the great flexibility in utilizing the available resources to get best performance. It worth mentioning here that one Cell processor with only 9 cores provides around 50% of the peak performance of the NVIDIA GTS 285 equipped with 120 cores. The major tradeoff in this model is the difficulty of programming. You may need few weeks to study the architecture and the programming model before you can start writing your code. Also you will have to spend even more time to optimize your application and get the best possible performance.


How much you would like to pay for the hardware and software?
You cannot isolate the price from the performance you can get out of the your processor. We can simply use the simple ration of dollars for each GFLOP. Table Below can tell you more about this.

Architecture
Cost
Theoretical Peak GFLOPs
Cost of 1 GFLOP
Intel Quad
Core Q9650

$379
48
$7.9
AMD Phenom
II X4 965

$200
45
$4.4
Power7
Not Yet Released
NVidia
GeForce GTX 295

$480
1788
$0.27
ATI Radeon
HD 5870

$420
2160
$0.19
Cell
Processor (Inside PS3)

$350
160
$5.83
Cell
Processor (One Blade)

$10,000
204
$49

Don't forget that the GPGPUs should be hosted by a full machine, which adds to the total cost per one GFLOP.

Can I use combinations of different architectures?
Yes, and you should be creative about that. For example, you can build your web application on top of a general purpose multi-core processor and use a specialized simple multi-core architecture to do parts of the workload. You can attach a GPU card to your machine and use it to search in large datasets or you can use it to do sorting and filtering of large search results. In this case you are combining heterogeneous architectures to get the best out of each. You can do this at the network level. You can build a hybrid cluster to make each node handle different parts of the workload. For example, nodes that do I/O intensive should have powerful general purpose multi-core processors utilizing multi-threaded architectures. These processors are very good in scheduling their threads and processes for I/O intensive applications. And you can allocate nodes with excellent processing power to do data filtering, sorting, or any other compute intensive tasks of your application.




Thursday, November 26, 2009

End of the Cell Broadband Engine?!

Yesterday the InternetNews.com released a piece of news about the end of the Cell Broadband Engine. David Turek the VP of deep computing at IBM said during an interview with the German site Heise Online that the power XCell-8i will be the last of the Cell line. IBM will be focusing on power7 processor, which is due mid 2010.

In this news article, it is mentioned, by Jon Peddie, that Cell processor had many shortcomings that became apparent, such as lack of direct access to the global memory by its computing engines (the SPEs) and wrongly mentioned that everything has to go through its powerPC core, which creates a bottleneck. This is technically not true. The PowerPC core is not handling any of the requests initiated by the other compute engines (the SPEs). Also, it might be noted by some researchers that its cache should be bigger, but its performance still noted by many to be the best compared to other multi-core processors fall within its category. In addition, the Cell processor taught a lot of developers and researchers the best parallel programming practices for multi-core processors. The fact that everything is controlled by the developer forced all its programmers to think better of the best ways to optimize their algorithm's execution time.

Although I may somehow believe that IBM may do changes to the Cell processor. It is very difficult to believe that IBM will end its Cell processor line that soon. IBM invested a lot of money and time and also many of its customers invested tons of money adopting the Cell processor.

I think IBM is trying to produce its own line away from Sony and Toshiba without giving away $500 million worth of investment and five years of engineering. It is about business. The cell processor is one of the master pieces in the multi-core processors. And as mentioned by David Turk the future is for hybrid multi-core processors, for a very simple reason: they provide great ratio of processing speed to consumed power.

I think IBM will reuse the SPEs instruction set along with their traditional PowerPC architecture but the change might be in how the cache will be organized and managed. Also, I think IBM is rethinking the cores interconnect network. They may use either dynamic networks or a mix between on-chip-network and shared cache architecture.

Friday, November 13, 2009

Nano-kernels for the Era of Exascale Computing

I'm talking here again about the multi-core processors for massively parallel systems working on complex scientific applications. However, I'm tackling this area from a different perspective. I would like to think with you of how multi-core processors will look like in five years from now and think of these questions: What are the serious problems that these processors will suffer from (from system's perspective)? Which of the current solutions or anticipated frameworks may help us solving these problems? I’ll be discussing only one of them here. It is very difficult to predict accuratly what technology advancements will take place in the comming five years. However, there are general trends that we can track and resonably predict their future effects.

Anyways, I mentioned before that multi-core processors will be of thousands of cores maybe even tens of thousands of cores (check the latest AMD GPGPU Radeon HD 5970). These cores will be very simple and solving the same problem but on different data chunks. This of course mandates the existance of shared resources and shared areas contain input data and be able to store results. The path from one core to the data storage and other shared resources will get more complex and involving more shared resources, such as more hirarchies of on-chip and off-chip caches, cores interconnections, I/O buses, etc. The anticipated path and hierarchies through which data will be traveling to reach sysem's main memory or to rech core's registers will have a very important effect on data movement latency. It is not about only having higer latency which can be hidden by many veified techniques. It is about the variance of this latency from one core to another and from one request to anothr inside the same core. Current software solutions, such as prefetching in multiple buffers depend on the fact that latency to move data from memory to all processor's cores will be the same at run-time. However, this is not true even in current multi-core processors. For example, inside the Cell Broadband Engine, the DMA latency (or memory latency) differs from one core to another depending on its physical location inside the chip and how far it is from memory controller. This variance will be even bigger as these processors grow in number of cores and as contension increases on shared resources inside them. Such variance requires solutions that would hide memory latency dynamically at run-time inside each core based on each specific core's data latency. Current software solutions, such as prefetching and multi-buffering are depending on constant memory latency across all cores.
Some hardware-based solutions tried to solve this problem through hyper- or multi-threading. Inside multi-core processors with multi-threaded feature, once a thread is blocked for an I/O or data movement, another thread gets active and resumes execution. Sun Microsystems through its latest UltraSparc T2 & T2 Plus, added up to four threads per core, which gives at the end large number of virtually concurrent threads on the same chip. However, there are two important drawbacks. First, if memory latency is pretty low for any reason these threads will be spending most of their time switching, which would give at the end a semi serial performance because four threads are sharing the same ALU and FP units. On the other hand, if memory latency is really high for all of the working threads inside the same core, we may end up with idle time because all of them will be waiting for data to come from system's memory or an I/O device. Second, it this solution adds complexity to the hardware and consume space that could be used for bigger cache or even more single threaded cores.



Nano-Kernels
Ok, what would be the solution then? If we could dynamically create a threading framework that can create and manage threads pretty much to hyper- or multi-threaded architectures, we may be able to solve data latency problem smartly and for massively parallel multi-core processors. As long as each core will have its own data latency, why don’t we create small software threads that would switch their context to core's local cache instead of switching it to the system's main memory or second level cache. The context in this case will be the core's registers (pretty much similar to current hardware based multi-threaded architectures) and few control registers affecting the execution of the thread, such as the program counter. So whenever one of these small threads, let's call them micro-threads, stalls for a data chunk to be copied from system's main memory, it will go to sleep mode and another micro-thread is switched to running mode and resume execution. A very small and very fast kernel, we may call it a nano-kernel, should actively run inside these cores to schedule micro-threads and make sure that data movement latency is hidden almost completely inside each core. This idea of having micro-threads has two advantages. First, the number of micro-threads is dynamic, which means number of micro-threads depends on data movement latency. For example, in large data movement latency we may add more micro-threads per core to work longer during other micro-threads wait time for data to be ready in core's cache. Second, context switching inside each core's cache makes it very cheap and very fast process, i.e. few of nano seconds. Of course, context saving will consume from each core's cache but this is already consumed by several magnitudes to implement the hardware based multi-threaded architectures. Also, this will require specific faciltities provided by the ISA. For example, manual cache management and internal interrupting facility inside each core are mandatory for this idea to work.
So, if we create nano-kernels doing optimization inside each core, we would reach new performance ceilings. It is scalable since each core has its own nano-kernel working independently and scheduling micro-threads based on resources given to the core. So, with thens of throusand of threads this solution would still work and get the most out of the expected massively parallel multi-core processors.