Monday, December 21, 2009

A New Platform

I moved my blog to a different platform.

Follow me: http://MohamedFAhmed.wordpress.com

It already has several new blog posts.

I'm no longer posting to this blog.

See you there :)

Saturday, December 5, 2009

Which multi-core processor will get the job done?

Multi-core processors are supposed to give us a better performance because simply they are out there for that sole purpose. But the question is: Which one would be the best to get the job done?

Well, the definite answer for this question is: it depends!

From my experiences with different architectures the best payback is from what you really want from them. I’m not talking here about the end user experiences I’m trying to handle this question from the developer’s perspective, mainly for applications that are compute intensive, such as scientific applications, data intensive applications, and discrete algorithms with considerably complex computations.

So, if the answer is: it depends, I think it depends on:
  • Which problems your machine will be working on?
  • How fast you want the code to be written 
  • How much you would like to pay for the hardware/software
Of course life is much more complex to only consider what I write here as the only guidelines to make up your mind. I’m only pinpointing important factors for your decision. Also, please feel free to ask me questions end of this post if you feel that I’m missing other important factors. I will also write in my upcoming posts more related details that would help you. I’m assuming that you are new to multi-core programming and would like to get started and tap new domains of programming.

Which problems your machine will be working on?
If you are intending to solve problems that you need to do in parallel for huge data sets, I would recommend using simple multi-core processors such as the GPGPUs or the Cell Broadband Engine. For example, one core inside the Cell Broadband Engine, Synergistic Processing Element (SPE), can give you around 25.5 GFLOPS; meanwhile one Intel Xeon core can give you around 9.6 GFLOPs. Inside the Cell processor you have 8 cores totaling around 205 GLFLOPs available for you. Also if you consider the GPGPUs you can get up to 1700 GFLOPS per one GPU card compared to a total of 96 GFLOPS for the latest Quad-2-Core general purpose processor from Intel. This category of applications or algorithms includes, but not limited to: string matching and searching, sorting large data sets, FFT computations, data visualization, network traffic scanning, and artificial intelligence.

However, if you are developing applications that can work independently and perform a lot of I/O, you would select the general purpose multi-core processors such as the Power7, Intel’s Quad-Core, AMD Phenom I & II, or Sun’s Spark T1 & T2 processors. These processors are providing in some cases up to 32 concurrently running threads, combining both multi-core and multi-threaded architectures. For example, building a web application and distributing the workload over three tiers: web server, business logic, and the database layers, would perfrom better on these architectures. Each one of these layers is I/O oriented; in addition, each one works as independent process. Each core in such case can handle a layer efficiently without affecting the execution of other cores. In addition, synchronization among them is handled by the operating system through the inter-process communication APIs provided by the OS.

How fast you want to start coding?
When it comes to ease or difficulty of multi-core programming, there are three levels of difficulty. Each level is trading off the difficulty with performance. I’ll start with the easiest and most common one.
Conventional general purpose multi-core processors are the easiest to program and get running parallel applications on top of them. You can still use the old programming models using PThreads or OpenMP to program multi-threaded applications. You don’t have to study the underlying hardware. You will have to only refresh your old parallel programming knowledge. In addition, if you are developing coarse grained parallel applications. You can run it as independent processes and use the old synchronization APIs provided by the operating system. What this programming model is trading off are: (1) Limited scalability: you cannot create and easily manage a lot of threads (I’m talking about 100s of threads) because of the hardware and programming model limitations, (2) You don’t have a lot of space to maneuver and do architecture specific optimization; the cache and threads scheduling is done on behalf of you, which may not provide the best performance for your application. So you can spend few days reviewing your old knowledge and you will be ready to produce parallel applications out of these models.

The next level is the new programming models recently built on top of the GPGPUs such as CUDA and OpenCL. These models are built to combine both conventional serial programming models with the new kernel based parallel programming paradigm. These would allow you to write the program and input initialization code in a serial fashion as it used to be; and then offload the compute intensive part to the GPU card. The kernel function gets executed by many threads; each thread is running in its own context. You should use the synchronization primitives provided by the framework for these offload threads. Of course, these platforms are for applications with fine grained parallelism. The main advantages that this model may give you: (1) Automatic management of many threads, 100’s of threads; the framework will create, manage and destruct the threads contexts transparently; (2) Hiding some of the architectural complexities; for example, you don’t have to manage the cache or pipelines optimizations. The tradeoffs of these models are: (1) Ease of programming; it is now more difficult to program with these models since your application or algorithm must be aware of the architectural heterogeneity. Also you will have to understand the some of the architectural aspects of the GPU card, such as the memory hierarchy; however, these aspects are not as complex as in the third model, (2) some performance is lost due to programming model abstractions. For example, you can create threads more than the number of available cores; this may delay the overall execution if these threads are synchronizing through single or few shared variables. It is recommended to use this model if you have highly parallel problems and don’t want to focus on most of the architectural aspects. You may need a week or two to understand the model and architecture before you start programming.

The third level is challenging but provides many distinction points for you as a developer or researcher who selects this programming model. I see only the Cell Broadband Engine as the only processor in this category. You can still program your parallel application using PThreads or OpenMP. However, this is only to create and kill the threads. To synchronize and properly implement your algorithm you will have to understand all the architectural aspects of the microprocessor, such as the DMA for cache management inside the Cell processor. The main advantage that this model provides is the great flexibility in utilizing the available resources to get best performance. It worth mentioning here that one Cell processor with only 9 cores provides around 50% of the peak performance of the NVIDIA GTS 285 equipped with 120 cores. The major tradeoff in this model is the difficulty of programming. You may need few weeks to study the architecture and the programming model before you can start writing your code. Also you will have to spend even more time to optimize your application and get the best possible performance.


How much you would like to pay for the hardware and software?
You cannot isolate the price from the performance you can get out of the your processor. We can simply use the simple ration of dollars for each GFLOP. Table Below can tell you more about this.

Architecture
Cost
Theoretical Peak GFLOPs
Cost of 1 GFLOP
Intel Quad
Core Q9650

$379
48
$7.9
AMD Phenom
II X4 965

$200
45
$4.4
Power7
Not Yet Released
NVidia
GeForce GTX 295

$480
1788
$0.27
ATI Radeon
HD 5870

$420
2160
$0.19
Cell
Processor (Inside PS3)

$350
160
$5.83
Cell
Processor (One Blade)

$10,000
204
$49

Don't forget that the GPGPUs should be hosted by a full machine, which adds to the total cost per one GFLOP.

Can I use combinations of different architectures?
Yes, and you should be creative about that. For example, you can build your web application on top of a general purpose multi-core processor and use a specialized simple multi-core architecture to do parts of the workload. You can attach a GPU card to your machine and use it to search in large datasets or you can use it to do sorting and filtering of large search results. In this case you are combining heterogeneous architectures to get the best out of each. You can do this at the network level. You can build a hybrid cluster to make each node handle different parts of the workload. For example, nodes that do I/O intensive should have powerful general purpose multi-core processors utilizing multi-threaded architectures. These processors are very good in scheduling their threads and processes for I/O intensive applications. And you can allocate nodes with excellent processing power to do data filtering, sorting, or any other compute intensive tasks of your application.




Thursday, November 26, 2009

End of the Cell Broadband Engine?!

Yesterday the InternetNews.com released a piece of news about the end of the Cell Broadband Engine. David Turek the VP of deep computing at IBM said during an interview with the German site Heise Online that the power XCell-8i will be the last of the Cell line. IBM will be focusing on power7 processor, which is due mid 2010.

In this news article, it is mentioned, by Jon Peddie, that Cell processor had many shortcomings that became apparent, such as lack of direct access to the global memory by its computing engines (the SPEs) and wrongly mentioned that everything has to go through its powerPC core, which creates a bottleneck. This is technically not true. The PowerPC core is not handling any of the requests initiated by the other compute engines (the SPEs). Also, it might be noted by some researchers that its cache should be bigger, but its performance still noted by many to be the best compared to other multi-core processors fall within its category. In addition, the Cell processor taught a lot of developers and researchers the best parallel programming practices for multi-core processors. The fact that everything is controlled by the developer forced all its programmers to think better of the best ways to optimize their algorithm's execution time.

Although I may somehow believe that IBM may do changes to the Cell processor. It is very difficult to believe that IBM will end its Cell processor line that soon. IBM invested a lot of money and time and also many of its customers invested tons of money adopting the Cell processor.

I think IBM is trying to produce its own line away from Sony and Toshiba without giving away $500 million worth of investment and five years of engineering. It is about business. The cell processor is one of the master pieces in the multi-core processors. And as mentioned by David Turk the future is for hybrid multi-core processors, for a very simple reason: they provide great ratio of processing speed to consumed power.

I think IBM will reuse the SPEs instruction set along with their traditional PowerPC architecture but the change might be in how the cache will be organized and managed. Also, I think IBM is rethinking the cores interconnect network. They may use either dynamic networks or a mix between on-chip-network and shared cache architecture.