Well, the definite answer for this question is: it depends!
From my experiences with different architectures the best payback is from what you really want from them. I’m not talking here about the end user experiences I’m trying to handle this question from the developer’s perspective, mainly for applications that are compute intensive, such as scientific applications, data intensive applications, and discrete algorithms with considerably complex computations.
So, if the answer is: it depends, I think it depends on:
- Which problems your machine will be working on?
- How fast you want the code to be written
- How much you would like to pay for the hardware/software
Which problems your machine will be working on?
If you are intending to solve problems that you need to do in parallel for huge data sets, I would recommend using simple multi-core processors such as the GPGPUs or the Cell Broadband Engine. For example, one core inside the Cell Broadband Engine, Synergistic Processing Element (SPE), can give you around 25.5 GFLOPS; meanwhile one Intel Xeon core can give you around 9.6 GFLOPs. Inside the Cell processor you have 8 cores totaling around 205 GLFLOPs available for you. Also if you consider the GPGPUs you can get up to 1700 GFLOPS per one GPU card compared to a total of 96 GFLOPS for the latest Quad-2-Core general purpose processor from Intel. This category of applications or algorithms includes, but not limited to: string matching and searching, sorting large data sets, FFT computations, data visualization, network traffic scanning, and artificial intelligence.
However, if you are developing applications that can work independently and perform a lot of I/O, you would select the general purpose multi-core processors such as the Power7, Intel’s Quad-Core, AMD Phenom I & II, or Sun’s Spark T1 & T2 processors. These processors are providing in some cases up to 32 concurrently running threads, combining both multi-core and multi-threaded architectures. For example, building a web application and distributing the workload over three tiers: web server, business logic, and the database layers, would perfrom better on these architectures. Each one of these layers is I/O oriented; in addition, each one works as independent process. Each core in such case can handle a layer efficiently without affecting the execution of other cores. In addition, synchronization among them is handled by the operating system through the inter-process communication APIs provided by the OS.
How fast you want to start coding?
When it comes to ease or difficulty of multi-core programming, there are three levels of difficulty. Each level is trading off the difficulty with performance. I’ll start with the easiest and most common one.
Conventional general purpose multi-core processors are the easiest to program and get running parallel applications on top of them. You can still use the old programming models using PThreads or OpenMP to program multi-threaded applications. You don’t have to study the underlying hardware. You will have to only refresh your old parallel programming knowledge. In addition, if you are developing coarse grained parallel applications. You can run it as independent processes and use the old synchronization APIs provided by the operating system. What this programming model is trading off are: (1) Limited scalability: you cannot create and easily manage a lot of threads (I’m talking about 100s of threads) because of the hardware and programming model limitations, (2) You don’t have a lot of space to maneuver and do architecture specific optimization; the cache and threads scheduling is done on behalf of you, which may not provide the best performance for your application. So you can spend few days reviewing your old knowledge and you will be ready to produce parallel applications out of these models.
The next level is the new programming models recently built on top of the GPGPUs such as CUDA and OpenCL. These models are built to combine both conventional serial programming models with the new kernel based parallel programming paradigm. These would allow you to write the program and input initialization code in a serial fashion as it used to be; and then offload the compute intensive part to the GPU card. The kernel function gets executed by many threads; each thread is running in its own context. You should use the synchronization primitives provided by the framework for these offload threads. Of course, these platforms are for applications with fine grained parallelism. The main advantages that this model may give you: (1) Automatic management of many threads, 100’s of threads; the framework will create, manage and destruct the threads contexts transparently; (2) Hiding some of the architectural complexities; for example, you don’t have to manage the cache or pipelines optimizations. The tradeoffs of these models are: (1) Ease of programming; it is now more difficult to program with these models since your application or algorithm must be aware of the architectural heterogeneity. Also you will have to understand the some of the architectural aspects of the GPU card, such as the memory hierarchy; however, these aspects are not as complex as in the third model, (2) some performance is lost due to programming model abstractions. For example, you can create threads more than the number of available cores; this may delay the overall execution if these threads are synchronizing through single or few shared variables. It is recommended to use this model if you have highly parallel problems and don’t want to focus on most of the architectural aspects. You may need a week or two to understand the model and architecture before you start programming.
The third level is challenging but provides many distinction points for you as a developer or researcher who selects this programming model. I see only the Cell Broadband Engine as the only processor in this category. You can still program your parallel application using PThreads or OpenMP. However, this is only to create and kill the threads. To synchronize and properly implement your algorithm you will have to understand all the architectural aspects of the microprocessor, such as the DMA for cache management inside the Cell processor. The main advantage that this model provides is the great flexibility in utilizing the available resources to get best performance. It worth mentioning here that one Cell processor with only 9 cores provides around 50% of the peak performance of the NVIDIA GTS 285 equipped with 120 cores. The major tradeoff in this model is the difficulty of programming. You may need few weeks to study the architecture and the programming model before you can start writing your code. Also you will have to spend even more time to optimize your application and get the best possible performance.
You cannot isolate the price from the performance you can get out of the your processor. We can simply use the simple ration of dollars for each GFLOP. Table Below can tell you more about this.
Architecture | Cost | Theoretical Peak GFLOPs | Cost of 1 GFLOP |
Intel Quad Core Q9650 | $379 | 48 | $7.9 |
AMD Phenom II X4 965 | $200 | 45 | $4.4 |
Power7 | Not Yet Released | ||
NVidia GeForce GTX 295 | $480 | 1788 | $0.27 |
ATI Radeon HD 5870 | $420 | 2160 | $0.19 |
Cell Processor (Inside PS3) | $350 | 160 | $5.83 |
Cell Processor (One Blade) | $10,000 | 204 | $49 |
Don't forget that the GPGPUs should be hosted by a full machine, which adds to the total cost per one GFLOP.
Can I use combinations of different architectures?
Yes, and you should be creative about that. For example, you can build your web application on top of a general purpose multi-core processor and use a specialized simple multi-core architecture to do parts of the workload. You can attach a GPU card to your machine and use it to search in large datasets or you can use it to do sorting and filtering of large search results. In this case you are combining heterogeneous architectures to get the best out of each. You can do this at the network level. You can build a hybrid cluster to make each node handle different parts of the workload. For example, nodes that do I/O intensive should have powerful general purpose multi-core processors utilizing multi-threaded architectures. These processors are very good in scheduling their threads and processes for I/O intensive applications. And you can allocate nodes with excellent processing power to do data filtering, sorting, or any other compute intensive tasks of your application.