One of the core features of HPX is our lightweight user-level threading. User-level threading implements a second layer of thread infrastructure on top of OS-threads (e.g. thread implementations provided by the operating system or kernel). This form of threading is also called hybrid or M:N (mapping N user threads onto M OS-threads) threading.
We recently conducted a benchmark of the scalability of lightweight user-level threads in the face of extremely fine-grained parallelism. Fine-grained parallelism refers to the division of work into very small parallel tasks. By making the tasks very small, the task scheduler is able to load balance more efficiently in the face of highly dynamic applications.
We used a benchmark which we call the Homogeneous Task Spawn benchmark. The benchmark is rather simple: a serial for loop spawns T tasks, with each task doing a fixed workload W that involves no synchronization or communication. We implement the workload as follows for all four libraries that we tested:
double volatile d = 0.; for (uint64_t i = 0; i < delay; ++i) d += 1 / (2. * i + 1);
To determine the value of W in walltime, we run a baseline code that uses a high precision timer to determine the time B that the serial execution of the above code N times takes. Then, we can compute W as W = B / N.
The source code for all benchmarks can be found in the HPX 0.8.0 release, in the tests/performance directory. Alternatively, you can access the individual tests directly here: HPX, Qthreads, TBB, SWARM.
Here are the results of the benchmarks. We ran them on an HP DL785 G6 node with 48 cores (8 sockets, AMD Opterons) and 96G RAM (DDR2 553MHz). The test machine was running Debian Linux (kernel version 3.1). The benchmarks were run on February 7th, 2012.
The four libraries benchmarked were:
As you can see from the results, HPX and TBB are closely tied for best performance when there is no arbitrary workload (0µs). This workload is the ultimate test of fine-grained parallelism. While we were pleased with our results here, we’d like to improve our performance at this level.
For 100µs and 1000µs workloads, HPX shows excellent, stable scaling curves. Even after HPX reaches the point of saturation (the point at which using more parallel processing units adds more overhead than speedup), HPX’s degradation is very slow and stable. The other libraries degrade much more quickly.
It is also interesting to note the consistency of the data from HPX and TBB, in contrast to the data from SWARM and Qthreads. We did multiple trials of each data point and averaged the results, so we do not believe that the discrepancies are negligible.
We hope to learn more from the results of this benchmark in the coming months. Analysis of this benchmark should allow us to improve HPX’s threading system to enable new levels of fine-grained parallelism.