Herb Sutter started a new blog series expanding on his idea of the ‘Free Lunch is Over’. If you have not read his articles on this topic here, I would suggest to do so. He thoroughly analyses trends in the development of our hardware and software systems and tries to predict where we’re going. In the end, I could not agree more with his final assessment:
To continue enjoying the free lunch of shipping an application that runs well on today’s hardware and will just naturally run faster or better on tomorrow’s hardware, you need to write an app with lots of juicy latent parallelism expressed in a form that can be spread across a machine with a variable number of cores of different kinds – local and distributed cores, and big/small/specialized cores.
What I am missing from his analysis is that he does not suggest how this should be done. Neither does he mention what should be the general objectives and criteria while designing such systems. To answer these questions we need to go back and do an analysis of why is it so difficult to build a highly parallel and still efficient system – and yes, I mean the whole system: hardware, operating system, runtime systems, compilers, and application software. What are the main fundamental impeding factors making our systems slow? At this point I hear you shouting ‘Writing software is hard and writing multithreaded software is incredibly hard!’. Right, I agree. However that’s not a fundamental system flaw, merely an artifact of our limited human abilities. Better tools and languages will make our live easier.
What Makes our Systems Slow?
Estimates say that we currently run our computers at way below 100% efficiency. The theoretical peak performance (usually measured in FLOPS – floating point operations per second) is much higher than any practical peak performance reached by any application. This is particularly true for highly parallel hardware. The more hardware parallelism we provide to an application, the better the application must scale in order to efficiently use all the resources of the machine. Roughly speaking, we distinguish two forms of scalability: strong scaling (see also Amdahl’s Law) and weak scaling (see also Gustafson’s Law). Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size. It gives an estimate of how much faster can we solve a particular problem by throwing more resources at it. Weak scaling is defined as how the solution time varies with the number of processors for a fixed problem size per processor. In other words, it defines how much more data can we process by using more hardware resources.
In order to utilize as much hardware parallelism as possible an application must exhibit excellent strong and weak scaling characteristics, which requires a high percentage of work executed in parallel, i.e. using multiple threads of execution. Optimally, if you execute an application on a hardware resource with N processors it either runs N times faster or it can handle N times more data. Both cases imply 100% of the work is executed on all available processors in parallel. However, this is just a theoretical limit (I am not going to investigate super-scaling here). Unfortunately, there are more things which limit scalability, mostly inherent to the hardware architectures and the programming models we use: there are four fundamental factors which make our systems SLOW:
- Starvation means to have insufficient concurrent work available to maintain high utilization of all resources.
- Latencies are imposed by the time-distance delay intrinsic to accessing remote resources and services
- Overhead is work required for the management of parallel actions and resources on the critical execution path which is not necessary in a sequential variant
- Waiting for Contention resolution is caused by delays due to the lack of availability of oversubscribed shared resources
Each of those four factors manifests itself in multiple and different ways; each of the hardware architectures and programming models expose specific forms. However the interesting part is that all of them are limiting the scalability of applications no matter what part of the hardware jungle we look at – hand-helds, PCs, supercomputers, or the cloud, all suffer from the reign of the 4 horsemen: Starvation, Latency, Overhead, and Contention. This realization is very important as it allows us to derive the criteria for solutions to the scalability problem from first principles, it allows us to focus our analysis on very concrete patterns and measurable metrics. Moreover, any derived results will be applicable to a wide variety of targets.
Technology Demands new Response
Today’s computer systems are designed based on the initial ideas of John von Neumann, as published back in 1945, and later extended by the Harvard architecture. These ideas form the foundation, the execution model of computer systems we use currently. But apparently a new response is required in the light of the demands created by today’s technology (as thoroughly analyzed in Herb Sutter’s article).
So let’s get back to the questions posed above. What are the overarching objectives for designing systems fulfilling the mentioned goals? While the answer I can give is purely my own opinion, I believe not to be off too far. The main objectives for me are:
- Performance: as mentioned, scalable and efficiency are the main criteria people are interested in
- Fault tolerance: the low expected mean time between failures (MTBF) of future systems requires to embrace faults, not trying to avoid them
- Power: minimizing energy consumption is a must as it is one of the major cost factors today, even more so in the future
- Generality: any system should be usable for a broad set of use cases
- Programmability: for me as a programmer this is a very important objective, ensuring long term platform stability and portability
So, what needs to be done to meet those objectives, to make applications scale better on tomorrow’s architectures? Well, the answer is almost obvious: we need to devise a new execution model – a set of governing principles for the holistic design of future systems – targeted at minimizing the effect of the outlined SLOW factors. Everything we create for future systems, every design decision we make, every criteria we apply, has to be validated against this single, uniform metric. This includes changes in the hardware architecture we prevalently use today, and it certainly involves new ways of writing software, starting from the operating system, runtime system, compilers, and at the application level. However the key point is that all those layers have to be co-designed, they are interdependent and cannot be seen as separate facets. The systems we have today have been evolving for over 50 years now. All layers function in a certain way relying on the other layers to do so as well. However, we don’t have the time to wait for a coherent system to evolve for another 50 years. The new paradigms are needed now – therefore, co-design is the key.
As it turn out,we do not have to start from scratch. Not everything has to be invented and designed anew. Many of the ideas needed to combat the 4 horsemen have already been had, often more than 30 years ago. All it takes is to gather them into a coherent approach. So please let me highlight some of the derived principles we think to be crucial for defeating SLOW. Some of those are focused on high-performance computing, others are more general.
Focus on Latency Hiding instead of Latency Avoidance
It is impossible to design a system exposing zero latencies. In an effort to come as close as possible to this goal many optimizations are mainly targeted towards minimizing latencies. Examples for this can be seen everywhere, for instance low latency network technologies like InfiniBand, caching memory hierarchies in all modern processors, the constant optimization of existing MPI implementations to reduce related latencies, or the data transfer latencies intrinsic to the way we use GPGPUs today. It is important to note, that existing latencies are often tightly related to some resource having to wait for the operation to be completed. At the same time it would be perfectly fine to do some other, unrelated work in the meantime, allowing to hide the latencies by filling the idle-time with useful work. Modern system already employ similar techniques (pipelined instruction execution in the processor cores, asynchronous input/output operations, and many more). What we propose is to go beyond anything we know today and to make latency hiding an intrinsic concept of the operation of the whole system stack.
Embrace Fine-grained Parallelism instead of Heavyweight Threads
If we plan to hide latencies even for very short operations, such as fetching the contents of a memory cell from main memory (if it is not already cached), we need to have very lightweight threads with extremely short context switching times, optimally executable within one cycle. Granted, for mainstream architectures this is not possible today (even if we already have special machines supporting this mode of operation, such as the Cray XMT). For conventional systems however, the smaller the overhead of a context switch and the finer the granularity of the threading system, the better will be the overall system utilization and its efficiency. For today’s architectures we already see a flurry of libraries providing exactly this type of functionality: non-preemptive, task-queue based parallelization solutions, such as Intel’s TBB, Microsoft’s PPL, Cilk++, and many others. The possibility to suspend a current task if some preconditions for its execution are not met (such as waiting for I/O or the result of a different task), seamlessly switching to any other task which can continue, and to reschedule the initial task after the required result has been calculated, which makes the implementation of latency hiding almost trivial.
Rediscover Constrained Based Synchronization to replace Global Barriers
The code we write today is riddled with implicit (and explicit) global barriers. When I say global barrier I mean the synchronization of the control flow between several (very often all) threads (when using OpenMP) or processes (MPI). For instance, an implicit global barrier is inserted after each loop parallelized using OpenMP as the system synchronizes the threads used to execute the different iterations in parallel. In MPI each of the communication steps imposes an explicit barrier onto the execution flow as (often all) nodes have be synchronized. Each of those barriers acts as an eye of the needle the overall execution is forced to be squeezed through. Even minimal fluctuations in the execution times of the parallel threads (jobs) causes them to wait. Additionally it is often only one of the threads executing doing the actual reduce operation, which further impedes parallelism. A closer analysis of a couple of key algorithms used in science applications reveals that these global barriers are not always necessary. In many cases it is sufficient to synchronize a small subset of the threads. Any operation should proceed whenever the preconditions for its execution are met, and only those. Usually there is no need to wait for iterations of a loop to finish before you could continue calculating other things, all you need is to have those iterations done which were producing the required results for a particular next operation. Good bye global barriers, hello constraint based synchronization! People have been trying to build this type of computing (and even computers) already back in the 1970’s. The theory behind what they did is based on ideas around static and dynamic dataflow. There are certain attempts today to get back to those ideas and to incorporate them with modern architectures. For instance, a lot of work is being done in the area of constructing dataflow oriented execution trees. Our results show that employing dataflow techniques in combination with the other ideas as outlined here allows to improve overall scalability for certain problems considerably .
Adaptive Locality Control instead of Static Data Distribution
While this principle seems to be a given for single desktop or laptop computers (the operating system is your friend), it is everything but ubiquitous on modern supercomputers, which are usually built from a large number of separate nodes (i.e. Beowulf clusters), tightly interconnected by a high bandwidth, low latency network. Today’s prevalent programming model for those is MPI (Message Passing Interface) which does not directly help with proper data distribution, leaving it to the programmer to decompose the data to all of the nodes the application is running on. There are a couple of specialized languages and programming environments based on PGAS (Partitioned Global Address Space) designed to overcome this limitation, such as Chapel, X10, UPC, or Fortress. However all systems based on PGAS rely on static data distribution. This works fine as long as such a static data distribution does not result in inhomogeneous workload distributions or other resource utilization imbalances. In a distributed system these imbalances can be mitigated by migrating part of the application data to different localities (nodes). The only framework supporting (limited) migration today is Charm++. The first attempts towards solving related problem go back decades as well, a good example is the Linda coordination language. Nevertheless, none of the other mentioned systems support data migration today, which forces the users to either rely on static data distribution and live with the related performance hits or to implement everything themselves, which is very tedious and difficult. We believe that the only viable way to flexibly support dynamic and adaptive locality control is to provide a global, uniform address space to the applications, even on distributed systems.
Prefer Moving Work to the Data over Moving Data to the Work
For best performance it seems obvious to minimize the amount of bytes transferred from one part of the system to another. This is true on all levels. At the lowest level we try to take advantage of processor memory caches, thus minimizing memory latencies. Similarly, we try to amortize the data transfer time to and from GPGPUs as much as possible. At high levels we try to minimize data transfer between different nodes of a cluster or between different virtual machines on the cloud. Our experience (well, it’s almost common wisdom) shows that the amount of bytes necessary to encode a certain operation is very often much smaller than the amount of bytes encoding the data the operation is performed upon. Nevertheless we still often transfer the data to a particular place where we execute the operation just to bring the data back to where it came from afterwards. As an example let me look at the way we usually write our applications for clusters using MPI. This programming model is all about data transfer between nodes. MPI is the prevalent programming model for clusters, it is fairly straightforward to understand and to use. Therefore, we often write the applications in a way accommodating this model, centered around data transfer. These applications usually work well for smaller problem sizes and for regular data structures. The larger the amount of data we have to churn and the more irregular the problem domain becomes, the worse are the overall machine utilization and the (strong) scaling characteristics. While it’s not impossible to implement more dynamic, data driven, and asynchronous applications using MPI, it is overly difficult to so. At the same time, if we look at applications preferring to execute the code close the locality where the data was placed, i.e. utilizing active messages (for instance based on Charm++), we see better asynchrony, simpler application codes, and improved scaling.
Favor Message Driven Computation over Message Passing
Today’s prevalently used programming model on parallel (multi-node) systems is MPI. It is based on message passing (as the name implies), which means that the receiver has to be aware of a message about to come in. Both codes, the sender and the receiver, have to synchronize in order to perform the communication step. Even the newer, asynchronous interfaces require to explicitly code the algorithms around the required communication scheme. As a result, any more than trivial MPI application spends a considerable amount of time waiting for incoming messages, thus causing starvation and latencies to impede full resource utilization. The more complex and more dynamic the data structures and algorithms become, the larger are the adverse effects. The community has discovered message-driven and (data-driven) methods of implementing algorithms a long time ago, and systems such as Charm++ already have integrated active messages demonstrating the validity of the concept. message driven computation allows to send messages without that the receiver has to actively wait for them. Any incoming message is handled asynchronously and triggers the encoded action by passing along arguments and – possibly – continuations. We suggest to combine this scheme with work queue based scheduling as described above, which allows to almost completely overlap any communication with useful work, reducing latencies to a minimum.
By no means I would have been able to write this article all on my own. Many of the ideas presented here have been developed by Thomas Sterling or are the results of very fruitful discussions inside and outside of our group. Moreover, I believe, that a single group of people is not sufficient to resolve all the problems, we need many more ideas and discussions to reach the goal of making our systems FAST: Fault tolerant, Asynchronous, Scalable, and Technologically co-designed.
Overall I believe the free lunch is not over yet (and I seem to agree with Herb Sutter), even if we might have to think harder how to take advantage of it. There is plenty of room for us to improve the scalability of our applications if we do it right, many ideas are already out there, all it takes is to gather them into a coherent holistic approach, almost guaranteeing a stable and reliable strategy towards future architectures and applications.