Monday, December 21, 2009

A New Platform

I moved my blog to a different platform.

Follow me: http://MohamedFAhmed.wordpress.com

It already has several new blog posts.

I'm no longer posting to this blog.

See you there :)

Saturday, December 5, 2009

Which multi-core processor will get the job done?

Multi-core processors are supposed to give us a better performance because simply they are out there for that sole purpose. But the question is: Which one would be the best to get the job done?

Well, the definite answer for this question is: it depends!

From my experiences with different architectures the best payback is from what you really want from them. I’m not talking here about the end user experiences I’m trying to handle this question from the developer’s perspective, mainly for applications that are compute intensive, such as scientific applications, data intensive applications, and discrete algorithms with considerably complex computations.

So, if the answer is: it depends, I think it depends on:
  • Which problems your machine will be working on?
  • How fast you want the code to be written 
  • How much you would like to pay for the hardware/software
Of course life is much more complex to only consider what I write here as the only guidelines to make up your mind. I’m only pinpointing important factors for your decision. Also, please feel free to ask me questions end of this post if you feel that I’m missing other important factors. I will also write in my upcoming posts more related details that would help you. I’m assuming that you are new to multi-core programming and would like to get started and tap new domains of programming.

Which problems your machine will be working on?
If you are intending to solve problems that you need to do in parallel for huge data sets, I would recommend using simple multi-core processors such as the GPGPUs or the Cell Broadband Engine. For example, one core inside the Cell Broadband Engine, Synergistic Processing Element (SPE), can give you around 25.5 GFLOPS; meanwhile one Intel Xeon core can give you around 9.6 GFLOPs. Inside the Cell processor you have 8 cores totaling around 205 GLFLOPs available for you. Also if you consider the GPGPUs you can get up to 1700 GFLOPS per one GPU card compared to a total of 96 GFLOPS for the latest Quad-2-Core general purpose processor from Intel. This category of applications or algorithms includes, but not limited to: string matching and searching, sorting large data sets, FFT computations, data visualization, network traffic scanning, and artificial intelligence.

However, if you are developing applications that can work independently and perform a lot of I/O, you would select the general purpose multi-core processors such as the Power7, Intel’s Quad-Core, AMD Phenom I & II, or Sun’s Spark T1 & T2 processors. These processors are providing in some cases up to 32 concurrently running threads, combining both multi-core and multi-threaded architectures. For example, building a web application and distributing the workload over three tiers: web server, business logic, and the database layers, would perfrom better on these architectures. Each one of these layers is I/O oriented; in addition, each one works as independent process. Each core in such case can handle a layer efficiently without affecting the execution of other cores. In addition, synchronization among them is handled by the operating system through the inter-process communication APIs provided by the OS.

How fast you want to start coding?
When it comes to ease or difficulty of multi-core programming, there are three levels of difficulty. Each level is trading off the difficulty with performance. I’ll start with the easiest and most common one.
Conventional general purpose multi-core processors are the easiest to program and get running parallel applications on top of them. You can still use the old programming models using PThreads or OpenMP to program multi-threaded applications. You don’t have to study the underlying hardware. You will have to only refresh your old parallel programming knowledge. In addition, if you are developing coarse grained parallel applications. You can run it as independent processes and use the old synchronization APIs provided by the operating system. What this programming model is trading off are: (1) Limited scalability: you cannot create and easily manage a lot of threads (I’m talking about 100s of threads) because of the hardware and programming model limitations, (2) You don’t have a lot of space to maneuver and do architecture specific optimization; the cache and threads scheduling is done on behalf of you, which may not provide the best performance for your application. So you can spend few days reviewing your old knowledge and you will be ready to produce parallel applications out of these models.

The next level is the new programming models recently built on top of the GPGPUs such as CUDA and OpenCL. These models are built to combine both conventional serial programming models with the new kernel based parallel programming paradigm. These would allow you to write the program and input initialization code in a serial fashion as it used to be; and then offload the compute intensive part to the GPU card. The kernel function gets executed by many threads; each thread is running in its own context. You should use the synchronization primitives provided by the framework for these offload threads. Of course, these platforms are for applications with fine grained parallelism. The main advantages that this model may give you: (1) Automatic management of many threads, 100’s of threads; the framework will create, manage and destruct the threads contexts transparently; (2) Hiding some of the architectural complexities; for example, you don’t have to manage the cache or pipelines optimizations. The tradeoffs of these models are: (1) Ease of programming; it is now more difficult to program with these models since your application or algorithm must be aware of the architectural heterogeneity. Also you will have to understand the some of the architectural aspects of the GPU card, such as the memory hierarchy; however, these aspects are not as complex as in the third model, (2) some performance is lost due to programming model abstractions. For example, you can create threads more than the number of available cores; this may delay the overall execution if these threads are synchronizing through single or few shared variables. It is recommended to use this model if you have highly parallel problems and don’t want to focus on most of the architectural aspects. You may need a week or two to understand the model and architecture before you start programming.

The third level is challenging but provides many distinction points for you as a developer or researcher who selects this programming model. I see only the Cell Broadband Engine as the only processor in this category. You can still program your parallel application using PThreads or OpenMP. However, this is only to create and kill the threads. To synchronize and properly implement your algorithm you will have to understand all the architectural aspects of the microprocessor, such as the DMA for cache management inside the Cell processor. The main advantage that this model provides is the great flexibility in utilizing the available resources to get best performance. It worth mentioning here that one Cell processor with only 9 cores provides around 50% of the peak performance of the NVIDIA GTS 285 equipped with 120 cores. The major tradeoff in this model is the difficulty of programming. You may need few weeks to study the architecture and the programming model before you can start writing your code. Also you will have to spend even more time to optimize your application and get the best possible performance.


How much you would like to pay for the hardware and software?
You cannot isolate the price from the performance you can get out of the your processor. We can simply use the simple ration of dollars for each GFLOP. Table Below can tell you more about this.

Architecture
Cost
Theoretical Peak GFLOPs
Cost of 1 GFLOP
Intel Quad
Core Q9650

$379
48
$7.9
AMD Phenom
II X4 965

$200
45
$4.4
Power7
Not Yet Released
NVidia
GeForce GTX 295

$480
1788
$0.27
ATI Radeon
HD 5870

$420
2160
$0.19
Cell
Processor (Inside PS3)

$350
160
$5.83
Cell
Processor (One Blade)

$10,000
204
$49

Don't forget that the GPGPUs should be hosted by a full machine, which adds to the total cost per one GFLOP.

Can I use combinations of different architectures?
Yes, and you should be creative about that. For example, you can build your web application on top of a general purpose multi-core processor and use a specialized simple multi-core architecture to do parts of the workload. You can attach a GPU card to your machine and use it to search in large datasets or you can use it to do sorting and filtering of large search results. In this case you are combining heterogeneous architectures to get the best out of each. You can do this at the network level. You can build a hybrid cluster to make each node handle different parts of the workload. For example, nodes that do I/O intensive should have powerful general purpose multi-core processors utilizing multi-threaded architectures. These processors are very good in scheduling their threads and processes for I/O intensive applications. And you can allocate nodes with excellent processing power to do data filtering, sorting, or any other compute intensive tasks of your application.




Thursday, November 26, 2009

End of the Cell Broadband Engine?!

Yesterday the InternetNews.com released a piece of news about the end of the Cell Broadband Engine. David Turek the VP of deep computing at IBM said during an interview with the German site Heise Online that the power XCell-8i will be the last of the Cell line. IBM will be focusing on power7 processor, which is due mid 2010.

In this news article, it is mentioned, by Jon Peddie, that Cell processor had many shortcomings that became apparent, such as lack of direct access to the global memory by its computing engines (the SPEs) and wrongly mentioned that everything has to go through its powerPC core, which creates a bottleneck. This is technically not true. The PowerPC core is not handling any of the requests initiated by the other compute engines (the SPEs). Also, it might be noted by some researchers that its cache should be bigger, but its performance still noted by many to be the best compared to other multi-core processors fall within its category. In addition, the Cell processor taught a lot of developers and researchers the best parallel programming practices for multi-core processors. The fact that everything is controlled by the developer forced all its programmers to think better of the best ways to optimize their algorithm's execution time.

Although I may somehow believe that IBM may do changes to the Cell processor. It is very difficult to believe that IBM will end its Cell processor line that soon. IBM invested a lot of money and time and also many of its customers invested tons of money adopting the Cell processor.

I think IBM is trying to produce its own line away from Sony and Toshiba without giving away $500 million worth of investment and five years of engineering. It is about business. The cell processor is one of the master pieces in the multi-core processors. And as mentioned by David Turk the future is for hybrid multi-core processors, for a very simple reason: they provide great ratio of processing speed to consumed power.

I think IBM will reuse the SPEs instruction set along with their traditional PowerPC architecture but the change might be in how the cache will be organized and managed. Also, I think IBM is rethinking the cores interconnect network. They may use either dynamic networks or a mix between on-chip-network and shared cache architecture.

Friday, November 13, 2009

Nano-kernels for the Era of Exascale Computing

I'm talking here again about the multi-core processors for massively parallel systems working on complex scientific applications. However, I'm tackling this area from a different perspective. I would like to think with you of how multi-core processors will look like in five years from now and think of these questions: What are the serious problems that these processors will suffer from (from system's perspective)? Which of the current solutions or anticipated frameworks may help us solving these problems? I’ll be discussing only one of them here. It is very difficult to predict accuratly what technology advancements will take place in the comming five years. However, there are general trends that we can track and resonably predict their future effects.

Anyways, I mentioned before that multi-core processors will be of thousands of cores maybe even tens of thousands of cores (check the latest AMD GPGPU Radeon HD 5970). These cores will be very simple and solving the same problem but on different data chunks. This of course mandates the existance of shared resources and shared areas contain input data and be able to store results. The path from one core to the data storage and other shared resources will get more complex and involving more shared resources, such as more hirarchies of on-chip and off-chip caches, cores interconnections, I/O buses, etc. The anticipated path and hierarchies through which data will be traveling to reach sysem's main memory or to rech core's registers will have a very important effect on data movement latency. It is not about only having higer latency which can be hidden by many veified techniques. It is about the variance of this latency from one core to another and from one request to anothr inside the same core. Current software solutions, such as prefetching in multiple buffers depend on the fact that latency to move data from memory to all processor's cores will be the same at run-time. However, this is not true even in current multi-core processors. For example, inside the Cell Broadband Engine, the DMA latency (or memory latency) differs from one core to another depending on its physical location inside the chip and how far it is from memory controller. This variance will be even bigger as these processors grow in number of cores and as contension increases on shared resources inside them. Such variance requires solutions that would hide memory latency dynamically at run-time inside each core based on each specific core's data latency. Current software solutions, such as prefetching and multi-buffering are depending on constant memory latency across all cores.
Some hardware-based solutions tried to solve this problem through hyper- or multi-threading. Inside multi-core processors with multi-threaded feature, once a thread is blocked for an I/O or data movement, another thread gets active and resumes execution. Sun Microsystems through its latest UltraSparc T2 & T2 Plus, added up to four threads per core, which gives at the end large number of virtually concurrent threads on the same chip. However, there are two important drawbacks. First, if memory latency is pretty low for any reason these threads will be spending most of their time switching, which would give at the end a semi serial performance because four threads are sharing the same ALU and FP units. On the other hand, if memory latency is really high for all of the working threads inside the same core, we may end up with idle time because all of them will be waiting for data to come from system's memory or an I/O device. Second, it this solution adds complexity to the hardware and consume space that could be used for bigger cache or even more single threaded cores.



Nano-Kernels
Ok, what would be the solution then? If we could dynamically create a threading framework that can create and manage threads pretty much to hyper- or multi-threaded architectures, we may be able to solve data latency problem smartly and for massively parallel multi-core processors. As long as each core will have its own data latency, why don’t we create small software threads that would switch their context to core's local cache instead of switching it to the system's main memory or second level cache. The context in this case will be the core's registers (pretty much similar to current hardware based multi-threaded architectures) and few control registers affecting the execution of the thread, such as the program counter. So whenever one of these small threads, let's call them micro-threads, stalls for a data chunk to be copied from system's main memory, it will go to sleep mode and another micro-thread is switched to running mode and resume execution. A very small and very fast kernel, we may call it a nano-kernel, should actively run inside these cores to schedule micro-threads and make sure that data movement latency is hidden almost completely inside each core. This idea of having micro-threads has two advantages. First, the number of micro-threads is dynamic, which means number of micro-threads depends on data movement latency. For example, in large data movement latency we may add more micro-threads per core to work longer during other micro-threads wait time for data to be ready in core's cache. Second, context switching inside each core's cache makes it very cheap and very fast process, i.e. few of nano seconds. Of course, context saving will consume from each core's cache but this is already consumed by several magnitudes to implement the hardware based multi-threaded architectures. Also, this will require specific faciltities provided by the ISA. For example, manual cache management and internal interrupting facility inside each core are mandatory for this idea to work.
So, if we create nano-kernels doing optimization inside each core, we would reach new performance ceilings. It is scalable since each core has its own nano-kernel working independently and scheduling micro-threads based on resources given to the core. So, with thens of throusand of threads this solution would still work and get the most out of the expected massively parallel multi-core processors.

Wednesday, November 11, 2009

Follow Me

You can now follow me on Twitter at this address:

http://www.twitter.com/MohamedFAhmed

Hopefully I'll be able to do it often enough !

Cheers

Wednesday, November 4, 2009

Challenges of Multi & Many-Cores Microprocessors

Multi- and Many-Core processors are here to stay for a really long time. They are microprocessors manufacturers response to the uni-core scalability walls. However, although software communities explored the parallel programming models heavily in the 80s and 90s, but these efforts were directed to less finer grained systems, mainly clusters, parallel machines, and Symmetric Multi Processors (SMP). Multi- and many-core architectures poped to the surface some classical problems, such as memory latency, data synchronization, and threads management. Also, they introduced new problems of massively parallel systems with to a great extent fine grained threading models, such as managing thousands of concurrent threads and inter-thread communication and data sharing. In this posting I would like to pinpoint some of these challenges from my research programming experiences on multi-core architectures.

Maintaining the current increase rate of processing power requires from micro-processors designers to introduce more processing cores per microprocessor. However, two important sides to be considered as more cores are introduced. First, processor cores will increase in high rate to double the speed every 18 months and keep Moore’s law in effect. Hence, it is expected to have, within five years, many cores processors with tens or even hundreds of processing cores on the same chip. Second, as number of cores is increasing, they will be with simpler designs and achieving simple tasks and each core will be faster from current single-core processors. Power and heat management issues will impose such design constraints on micro-processors manufactures. Such design aspects will increase overall processor’s speed while maintaining reasonable power consumption and heat dissipation.. As a result, parallelism will be finer grained. Developers will parallelize their applications at a more fine grained level to take full advantage of the multi or many-cores advancements. This granularity will increase the contention among these threads on shared resources. These resources can be a memory location, i.e. data, or an I/O device. Interdependencies among these threads will increase. In addition, as cores are getting simpler and faster, more data will be moving back and forth between processor and system's main memory. On the other side, memory latency ration is increasing. Using hardware based techniques to hide this latency, such as branch prediction and embedded algorithms for cache replacement, may not be lucrative and efficient enough to hide this latency for parallel applications. Software based cache management and execution scheduling are now vital to fully utilize multi-core processors. Finally, programming complexity of multi-core processors and inherit complexity of parallel applications require tools to reduce some of these complexities.


Memory Latency Wall

As processors and programs become more parallelized, they will be more data hungry. On the other hand, the number of processor cycles to access system's main memory grew from few cycles in 1980 to almost a thousand cycles today. Moreover, the cache per core ratio will continue to go down, which will make the memory latency problem worse if cache not managed properly. Although there is a great potential in the DRAM based memory to increase performance, but the growth rate of processors aggregate cycles will continue to be faster. The processor-to-memory performance gap is expected to grow by 50% per year according to some estimates. The good news is that memory latency problem can be solved using efficient software based scheduling for memory access. Multi-core processors are now returning some control back to software developer to manage each core's cache. Such explicit cache management capabilities provide more space for programmers to maneuver around the processor-to-memory performance gap.

Data Synchronization

Using current synchronization mechanisms to synchronize tens or hundreds of threads access to one resource may lay on the line application’s performance. The whole system or application may suffer from deadlock or starvation due to weak synchronization mechanisms. In worst cases, current synchronization mechanisms will serialize the application in areas that need access to shared resources. As number of hardware threads in multi-core processors and parallelism increase in applications, the resulting performance lose increase as well. For example, implementing parallel shared counting algorithm would require from each of the participating threads to lock the counter before incrementing it. In worst performance case, each thread will have to wait for n-1 threads before it can update the counter, where n is the number of threads. If these processors are on the same die, efficiency of data synchronization can be greatly enhanced if data communication is done using available on-chip facilities, such as cores interconnect, shared cache, etc. Current synchronization techniques are using system's main memory to write and read shared data, which makes it even worse. Such technique introduces the memory latency delay in addition to the delay of synchronization algorithms.

Programming Complexity

Parallel computing is inherently complex mainly due to the difficulty of design and intricacy of resources sharing and synchronization. Presence of multi-core processors at different scales, starting from embedded systems to super computers, made application's adaptations to these new hardware platforms a critical issue. However, as multi-core processors are increasing their cores and parallelism is becoming more fine-grained, complexity will increase as well. Instead of designing a parallel application with 10 or 20 concurrent threads, an application may be executing 100s or 1000s of threads working on the same machine. A solution is required in this case to help reducing the programming complexity and also providing excellent scaling for the number of working threads.

Actually, these challenges are the main inspirational pillars for most of multi-core researchers, architects and developers. All microprocessors manufacturers are after faster processors without increasing programming complexity and without loosing developers ability to make best use of their new architectures. That's why microprocessors manufacturers are now involved aggressively in the programming models. Intel for example created Intel's Parallel Studio (Open CT framework included) for their general purpose multi-core microprocessors and specialized one as well, such as Larrabee GPGPU. Also, ATI built ATI Stream framework and NVIDIA also built CUDA framework to help developers make the best out of these new microprocessors without getting into the nitty-gritty architectural details of these advanced GPGPUs. 

Monday, November 2, 2009

Multi-Core Programming Experiences



In my previous writing I tried to characterize multi-core processors and quickly pinpoint the major distinction features in multi-core processors. In this blog post I would like to briefly share with you some of the untapped aspects inside the Cell Broadband Engine and The GeForce GTX 295.


The Cell Broadband Engine (CBE)
The Cell microprocessor is an innovative heterogeneous multi-core processor built by IBM, Sony, and Toshiba. 400 designers worked closely together for 5 years to invent the new heart for the PlayStation3 (PS3). The PS3 was only a starting point for the CBE. IBM made the first stab and re-introduced the CBE as a highly capable microprocessor for compute intensive jobs.
I won't repeat the explanation of the CBE architecture. You can find it at Wikipedia, IBM's website, and many other places if you Google it. I implemented several discrete algorithms, such as graphs traversals and integers sorting, and scientific algorithms as the FFT. I also was able to get into its architectural details and build a proprietary threading model, called micro-threading. This framework hides memory latency in most workloads without engaging the developer into the lowest architectural details of the Cell processor. I also made a series of experiments to characterize the effect of it architectural properties on memory latency pattern in different workloads.
The beauty of the CBE, from my points of view, is in the great extent of control it gives you to reach the best execution time. All the other architectural properties regarding its compute capabilities, multiple level of parallelism, and its on-chip network discussed in many publications and I experienced all these great features during my PhD journey. However, I would like to note the following from some of my experiences.
Although the Element Interconnect Bus (EIB) is of high performance and very low latency, the effect of this topology over cores performance is overlooked by most research efforts. For example, from my memory latency measurement experiments I found out that memory latency differs from one core to the other depending on how far this core is from the memory controller. This does not reflect a flaw in the design, but this is a property that would change the measure of memory latency per each core.  As this topology is used more and have more cores connected to the same ring, the physical location effect will be of more importance. The question is: how would this affect my performance as long as I can use techniques such as multi-buffering and data prefetching? The answer is very simple; as long as the memory latency differs from one core to the other, you need to hide it according to each core's latency measures. For example, in cores with relatively very high memory latency you can use more buffers to prefetch your data compare to other cores with lower memory latency. I already discussed this in my optimum micro-threading scheduling paper, please have a quick look at it to understand this issue more.
Also, in the new implementation of the CBE  can work now with larger RAM. However, this RAM is based on the DDR3 technology. The initial implementation using the XDR RAM had the limitation of a maximum memory of 1 GB. Although the new CBE implementation has a faster floating point unit and two memory controllers to keep high processor-memory bandwidth, but the memory latency is getting higher due to the high latency of the DDR3 RAM compared to the XDR implementation.


NVIDIA GeForce GPGPU
I worked also on NVIDIA GeForce GTX 295 to implement some algorithms in information retrieval. It is still a work in progress and to be published. However, from my insights of the Cell Broadband Engine, I  figured out some architectural properties that worth also sharing with you.
NVIDIA GPUs programming framework is abstracting to a great extent the architectural details of the microprocessor. From productivity point of view, this is a great feature. However, it is leaving few options for researchers and experienced developers to explore different options inside it. For example, I couldn't find a clear way that would help me scheduling threads to processor's cores. It is done, not sure yet, by the processor's scheduler. It is following the same policy used by Intel's hyper-threading and Sun's multi-threading architecture. I'm now measuring if memory latency differs from one core to another.  My initial measures show that memory latency is almost the same across all cores. However, I'm a little bit concerned about the relatively many hierarchies built into the processor. I realize that the shared memory model mandates such hierarchy to have reasonable synchronization overhead. However, adding hierarchies to run away from this problem is not the best solution. NVIDIA is still investing in this hierarchical through their new GPU processor, Fermi.


Although the multi-core processors are providing a smart escape from physical limitations of the uni-core processors, but they need thorough architectural analysis to best utilize their resources. I think this is attainable by monitoring different execution patterns and pinpointing bottlenecks. This should make it easier build efficient programming models, run-time systems, and algorithms for multi- and many-core microprocessors. For example, inside the Cell Broadband Engine microprocessor manual cache management and the Element Interconnect Bus (EIB) formed the opportunity to build run-time systems to get best  performance while simplifying the programming model, such as micro-threading, MPI microtask, data prefetching. I think multi- and many-cores processors will evolve through a closed feedback loop, see below.






Whenever a new architecture is being introduced, developers and researchers start implementing different algorithms and applications to get the best out of it. However, performance bottlenecks pops to the surface very soon. Several efforts try to solve these bottlenecks through either tweaking implemented algorithms or building general frameworks that would identify at run-time performance degradation parameters and change them. On the other hand the loop is properly closed if microprocessor designers listen to developers notes and try to hide these bottlenecks through architectural enhancements. This I think was properly handled in the new Fermi architecture of NVIDIA's GPGPUs. For example, Fermi has now multiple memory controllers each is handling requests of different groups. This should reduce effect of memory requests serialization, which is a serious performance bottleneck.

I believe research teams are moving now from the naive ways of speeding up multi- and many-core processors by doing the old tricks of algorithmic enhancements to digging more into the processor's architectural features and proposing better programming models backup by run-time libraries and frameworks. This trend is blending compilers, operating systems, parallel programming, and microprocessors architecture together.


Thursday, October 22, 2009

Be the first one there !

If you want to be in the leadership position, you have to look into the future. Your eye should be directed towards where the others would Like to be at. Then you should set your objective to be there before them. If one nation would like to be in the leadership position, the whole nation must be in the race to that position before any other nation. I think it is not about having the latest technology or following certain procedures made by others. However, it is about doing your best to reach that future point in shorter time.

At the personal level, if you have that goal engraved strong enough in your heart and planned thoroughly to reach your objective, you will develop all the tools, attitude, and knowledge that would keep your eye looking info he future.
Having the ability and vision for the future needs you to understand what is happenning around right now. Understand the latest in your field, talk to the experts and know them, and get involved with the corresponding communities.

Sunday, October 18, 2009

Characterizing Multi- and Many-Cores Microprocessors

The Emergence of multi-core processors is changing the face of computing forever. Concurrency and parallelism in programs execution will be the dominant speedup factor in both microprocessors design and applications programming. Multi-core processors are now in all kinds of devices with compute capabilities, starting from embedded systems such cell phones and TVs power by Cell and ARM processors, to laptops and desktops powered by Intel and AMD multi-core processors, and to supercomputers such as the Roadrunner running thousands of processors each has at 4 to 9 cores working on the same chip. In addition, multi-core processors will not stay as we experience them nowadays. They will be of tens, hundreds or even thousands of cores working concurrently on the same chip. Microprocessors designers already hit the physical walls of speeding up single-core processors through increasing frequency, deeper pipelines, or adding more cache. To keep catching up with Moore's law, adding more cores is the key now to double processor's computing speed every 18 months. So, expect to have more than 10 cores per processor two years from now.

I would like to take you through a series of writings in coming few weeks to discuss the general aspects of multi/many-cores processors. I would like to start by characterizing multi-core processors and go through the general architectural features that they have in common. This might help in understanding where multi-core processors may go. Also, I would like to quickly discuss three critical problems that many researchers in computer science and engineering actively working on. Also I will discuss possible solutions so that you can direct your attention to the useful trends in such exciting technology.


Please consider that what I write in my blog is a result of my research and my tracing of different technologies in the market or academia right now. These blog posts may get obsolete quickly depending on how fast technology will change. I will be doing a good job if my expectations become valid and true three or five years from now!

This week, I’ll discuss the general aspects of multi-core processors. I’ll try to explain the anatomy of multi-core processors based on their short history and current developments.

So, let’s get started ………

Characterizing Multi-Core Processors

Although all designers and microprocessors manufacturers realize that multi-core architecture is the best way nowadays to keep improving microprocessors performance, but they all do not agree on how multi-core processors should be designed and utilized. The differences come from where these processors will be used and also because of the many possibilities to build such architecture. Important architectural decisions are now more complex, such as instructions sets, cores interconnections, cache sharing and size, etc. This complexity gives more possibilities and more tradeoffs in microprocessors designs. I would like this week to pinpoint major architectural decisions and their possible effects on the overall processor’s performance and goodness.
I think design decisions for multi-core designers are in these areas: homogeneity of processor’s cores, cache sharing, cache size and hierarchy, cache management and control, embedded threading, and cores interconnection.

Homogeneity and Complexity of Cores
The first intuition of multi-core processors is to have homogenous cores inside the same processor. However, this is not always the best design for a multi-core processor. It depends on the domain that the new processor is targeting. If it is serving users with desktop sized machines, the main requirement would be a faster processor being able to serve multiple applications at the same time. In such case you may need homogenous processors serving each application independently. This is what you can typically find in Intel, AMD, and Sun Microsystems current multi-core processors. They would like to keep the backward compatibility and achieve good overall machine speed up. Although they are providing better scalability to improve performance compared to single-core processors speed, but they are more like SMP on a chip. They work most of time independently and the synchronization mechanisms are still limited among processor cores. For example, Intel’s dual-core and quad-core processors are connected using a simple multiplexing mechanism, cross bars interconnects, which assumes that these cores are not likely to communicate and exchange data most of the time. Also note that these cores are inheriting the complexity and deep pipelines of the single core architectures. I think these companies wanted to quickly get their multi-core processors in the market by utilizing the old designs into single chip using the new smaller manufacturing technology. When you have a closer look at the performance of each application you would not feel any significant improvement. So in such type of homogenous but older cores designs you will have performance improvement by being able to run more applications concurrently but same speed almost per application. The advantage of such design decision is being able to run all applications designed for single-core processors with no change on these multi-core processors. It is also easier for the operating systems designers to adapt to these changes, since they can consider these multi-core processors as SMP (Symmetric Multiprocessor) systems. Of course, they will improve processes scheduling and inter-process communication, but it is the same overall mechanism.

The other alternative is to have homogenous multi-core processor but with simpler cores designs and possibly different instruction set. For example, most of the GPUs, such as NVIDIA’s GeForce GTX 295, have many cores but they are all identical and simpler in design. Intel also is launching Larrabee microprocessor, which is using some of the older x86 instructions set but simplified in design, such as pipelines with fewer stages. These processors have great performance improvement. Simplicity of cores makes each core execute code faster. Also gives more space to have more cores on the die, which would boost the overall processor’s performance. However, these microprocessors are targeting performance improvement for applications designed to run on parallel architectures or with multi-threaded software design. Such basic requirement makes programming these processors more complex. Since the programmer must understand the architecture and the best way to utilize the huge processing power available on the chip. So, these processors are designed to improve performance for a single parallel application optimized to run smaller but more threads. I’ll discuss the programming complexity later on. In most cases operating systems are taking control over such processors. These homogenous simple core based microprocessors are only managed at the macro-level, i.e. create tasks or deallocate tasks. Operating systems do not, so far, handle threads or resources scheduling on these processors. They are managed by run-time libraries. These processors will give you tremendous performance improvement. For example, the observed performance of the NVIDIA’s GeForce GTX 295 is around 385 Giga Flops. To give a notion of how fast they are compared to traditional processors, Xeon processor can give you up to 9 Giga Flops. They are now of the same price!
Heterogeneous multi-core processors are providing a different way to design highly performing multi-core architecture. Existing embedded multiprocessors, such as the Intel IXP network processing family, keep at least one general-purpose processor on the die to support various housekeeping functions, provide the hardware base for more general operating system support, and execute the serial part of parallel applications. Similarly, the IBM's Cell BE processor has one general purpose core PPE and eight tailored processing elements SPEs. Keeping a larger processor on chip may help accelerate “inherently sequential” code segments or workloads with fewer threads.
There are many comparative points worth discussing between both homogenous and heterogeneous multi-core processors. Using Gustafson's scaled speedup or Amdahl's speedup laws, in most parallel problems the serial part is constant even if the problem size is increasing. A proper speedup is reached as the ration of the serial part to the parallel part of the implemented algorithm increases. Such perspectives allow multi-core processors to reach nice linear speed up as the number of cores used in parallel portions of the problem is increasing. Heterogeneity in multi-core processors may reach better speed up than homogenous multi-core processors for two main reasons. First, availability of one faster and powerful processor performing the serial portion makes it easy to freeze or shorten the serial part execution time. Second, other cores can be simpler and faster to perform the parallel portion in shorter time. Increasing the number of simple cores diminishes the effect of the serial part. Hence, according to Gustafson's scaled speedup law, if we have a single powerful core handling the serial portion of the program and four times faster than other cores, we may have speeded up equal to:

S(P)=P- α((P-1)/4)

Where α is the serial part of the algorithm. Hence, as P increases (number of cores used in parallel part) and as the relative power of the core performing serial part is increasing, the effect of the serial part is diminishing.

Using Amdahl’s law, the speedup using heterogeneous multi-core processors is even better. The comparative speedups of homogenous 1000 simple processor design and a heterogeneous 91 processor design relative to a single simple processor are:

Speed up Homogeneous=1/((0.1-0.9/100) )=9.2 times faster

Speed up Heterogeneous=1/((0.1/2-0.9/90) )=16.7 times faster

Given that 10% of the time a program gets no speed up on a 100-processor computer. We are also assuming to run the sequential code twice as fast, a single processor would need 10 times as many resources as a single core runs due to bigger power budget, larger caches, a bigger multiplier, and so on.

In addition, heterogeneous processor solutions can show significant advantages in power, delay and area. Processor instruction-set configurability is one approach to realizing the benefits of processor heterogeneity while minimizing the cost of software development and silicon implementations, but this requires custom fabrication of each new design to realize the performance benefit, and this is only economically justifiable for large markets, such as the Cell BE architecture used to power the new generation of Sony's Playstation3.

On the other hand, a single replicated processing element has many advantages; in particular, it offers ease of silicon implementation and regular software environment. Manage heterogeneity in an environment with thousands of threads may make a difficult problem impossible.

Homogenous multi-core processors manufacturers are building on their serial processors architecture. They are including on a single chip two or more processors with the same instructions set. Different version of these chips these manufacturers are releasing. Variations are in: cache sharing, cache levels, embedding threading, and cores interconnection.

However, increasing these cores given the physical limitations and the maximum number of transistors that can be utilized on a single chip, processor's cores will be also different from the ones we encounter right now. As long as parallelism needs to increase and distribute processing load across more cores, it is required to have simpler and fast ones with fewer transistors allocated to each core. So in order to get a proper increase, designers will have to simplify these cores. Of course in this case not all cores will be identical; heterogeneous cores will be another necessity of the speedup requirements. Actually, some existing designs already support this model. The Cell Broadband Engine for example is one of the leading heterogeneous multi-core processors following such path. Also, if we are considering the GPUs as part of the host machine they are considered another form of heterogeneous multi-core computing.


Cache Sharing Models
Cache sharing among processor cores is driving mainly their performance and communication patterns. Of course cache sharing or separation affects the number of transistors and overall power consumptions, but we will discuss mainly here performance and communication outlooks. As shown in the figure below there are two possible designs for cache allocation. In drawing (a) both processors are using the same shared second level cache (L2). Distribution of cache between processor cores can be done either voluntary or hardcoded in each processor’s instructions sets. In the first case the address space of the whole cache is accessible to both processors. If we have a processor with two cores one of them can release some of its cache space to the other core voluntary if the second core is running a memory intensive task. It also provides the advantage of sharing and communicating data using the shared cache space. However, referring to the same example, two cores can compete for the available cache resources. This will result a lot of cache misses to both processors, which now costs each core several hundreds of execution cycles to retrieve data from system’s memory. The second possible design is separate caches per core in, as show in drawing (b). Each core has its own second level cache totally unaffected by other core’s caching policies. Although each processor is independent in cache misses rate, but they are losing shared cache based data communication mean. Creating independent cache for each core is introducing more circuits on the chip and imposing more power and heat overheads on the system.




The trend is not yet clear. Different processors using different models of cache sharing. For example, in the GPUs there is a relatively large hierarchy of cache, around 3 levels of caching on the same die. In other processors, such as the Cell Broadband Engine, there is only one level of cache; each core has its own cache managed independently. The disadvantage of this model is the limited cache that can be allocated to each core. The Cell processor, for example, has only 256 KB of cache for both data and instructions per core. However, in the shared cache architectures more cache can be allocated on the chip and possibly that one core can get if other cores are yielding their share of the cache. It depends in this case on the cache allocation policy.


I think the payback of cache sharing or separation comes from the processor’s usage pattern. If you are doing a lot of sharing and communication among running threads, a shared cache model would be the best. However, shared cache, or shared memory model, has diminishing returns as the number of threads increase, and consequently contention on cache. You should be careful about this. GPU are easing this problem by having more hierarchies to allow different levels of data cache sharing. Such architectural decision should decrease contention. On the other hand, separate cache should allow each core work independently and manage its cache according to the workload it has currently. However, it makes sharing of data more difficult since it must be explicit in this case. Also, such architectural decision adds more load on other shared resources, such as the cores inter-connections, to move data among these separate cache areas.

Cache Management and Control
Cache management and control is one of the abstracted architectural aspects provided by single core processors. Adding more cache and predicting branching in applications were enhanced in each generation of the single core processors. The explicit cache management and control was an optional feature. And in many programming models and tools cache management techniques were overlooked to keep the programming model simple and easy for programmers. However, multi-core processors are becoming more pervasive. They are now manufactured for both general and specialized computing models. Moreover, parallelizing applications on multi-core architectures is still in its early stages. Neither the processors’ designers nor massively parallel applications designers know for sure the best cache management techniques. Therefore, processors manufacturers are giving more cache management and control capabilities back to programmers. No solid design patterns have been proven or frameworks implemented and tested to reveal best general or automated cache management techniques. It is still possible to use hardware managed cache with few cores based processors, i.e. two or four cores as long as they are running general computing applications independently from each other. So they can run and operate similar to Symmetric Multi-Processor (SMP) based machines. However, where there is application specific interdependency between processor’s cores it is better to provide explicitly cache management mechanisms. Cell Broadband Engine (CBE) processors are the first generation of multi-core processors that gives full per-core cache management and control to application developers. Although it is making the programming model harder than the conventional one, backed-up with automated cache management mechanisms, but it is providing application programmers a power tool to hide memory latency by queuing memory load and store requests and using double or quad buffering techniques.

Cache Size and Hierarchy
First Level (L1) and Second Level (L2) caches were always used in single core processors. That leveling of caching was mandatory to hide memory access latency. Many multi-core manufacturers are still using the same hierarchical schema of caching to hide memory latency. However, multi-core processors are now generating more contention on system's main memory. They have different access patterns to different memory locations. Although multiple level of cahce would enhance sharing of data among processors, but this may dramatically increase cache misses. On the other hand, if each core has its own independent hierarchy of cache, this would also increase significantly number of transistors. It will also add to the memory management complexity. Most of the multi-core processors with 4 cores or more are using small caches and shallow Hierarchy. Reduced cache size adds more space on the chip to include more cores. More cores means more overlapped memory access, which hides naturally its latency. I think shallow hierarchy reduces the cost of cache misses and simplifies cache management mechanisms. For example, the Cell BE processor has eight specialized small cores (SPEs); each has only 256 KB L1 cache (called Local Store or LS) managed explicitly by the programming using Direct Memory Access commands (DMA). Such architecture enabled the Cell BE to reach 25.5 GB/s memory-processor bandwidth. I’m still investigating in the GPGPUs the effect of having multiple hierarchies for different groups of cores inside the microprocessor.

Cores Interconnections
At the level of the physical hardware interconnect, multi-cores have initially employed busses or cross bar switches between the cores and cache banks – as shown in figure below. However, such solutions are not scalable to 1000s of cores. We need on-chip topologies that scale close to linearly with system size to prevent the complexity of the interconnect from dominating cost of many core systems. Scalable on-chip communication networks will borrow ideas from larger-scale packet-switched networks. Already chip implementations such as IBM Cell employ multiple ring networks to interconnect the nine processors on the chip and use software-managed memory to communicate between the cores rather than conventional cache-coherency protocols.


Again it is a little bit confusing because both models of using on-chip network topology and implicit interconnection to communicate shared data are still developing for multi-core processors and each of them is providing promising performance. In GPGPUs for example, shared cache is the only way to link cores together. However, in other processors such as IBM's Cell and Intel's Larrabee microprocessors on-chip ring topology is employed to link cores together. I see both architectures scalable through introducing multiple communication hierarchies in the shared cache model and having multiple on-chip network segments to reduce contention. It is quite difficult to say which one will dominate because right now we are in a middle of a battle of cores interconnection standards and technologies. IBM is now investigating  the possibility of having optical processors interconnection to reduce power consumption and increase performance. NVIDIA on the other hand is investigating different types of cache and different topologies of cache hierarchy.

I think that's enough for this week :)

Next week I will quickly discuss some of my findings while working with the Cell Broadband Engine and GPGPUs.
 

Tuesday, September 22, 2009

Finally A PhD Holder !

Today I'm closing a very important chapter in my life. I defended my PhD dissertation and became part of the community. Before my presentation the only thoughts I had in mind were about my research and how I may answer committee's questions in best possible way. However, once they signed my PASS form I had a totally different thoughts and contradicting feelings.

My presentation went very good. I was happy with it and my advisors were happy also. What really touched me is the good presence of other graduate students, mainly PhD students, interested in my research topic and would like to build on my findings. I felt then the value of contributing to the scientific body and being part of the stack adding to man knowledge. I had some long discussions with my colleagues about the future work and the possibility of working together sometime soon when we cross paths again. I felt sorry for my advisors because they used to hear the same ideas, problems, solutions, and results during our weekly research meetings :) However, it is fun when you talk about your three years of research. This is the point where you have all your confidence. You know that you are one who knows best about this area. Although it does not immune you from being asked questions that are very difficult to answer. The nature of being very focused in the PhD research can make you easily get dragged in the details and forget about other critical areas or aspects of your research. This where you can be hit during your defense.

Once I got the PASS form signed from my committee. I felt reluctant delivering this form to the graduate school because this means my official graduation from UCONN. Although I had hard time adopting to campus life again beginning of my PhD, but these year were the best in my life after my undergraduate studies. I learnt a lot and met smart and interesting persons. I worked in creative environment with minds focused on finding creative solutions for critical problems and shaping the future of computing. Although I was working hard in my research to finish and graduate as soon as possible, but at that day I didn't want to leave! It felt as moving from a small bowl to the huge deep blue ocean. The challenges are as many as opportunities and as hard as the big the rewards might be.

However, I also had the feeling of getting my freedom back. Now I may have many options in my career. Although these days it is very difficult to find a job, but on the long run I can create more options for myself.  I can resume working on my search engine algorithm and system. I can fund it with the few extra dollars I may have :) I can collaborate with more people in areas different from my PhD's main research topic.

I think I should be very careful and keep my advisors' voices and recommendations in front of me. This should to keep me away from slowing down and missing the opportunities of solving challenging problems. Thanks to all of them. I feel lucky having the opportunity working with all of them; thanks professors Reda, Raj, Swapna, and Ian.

Friday, September 11, 2009