Benchmarking User Level Threads

One of the core features of HPX is our lightweight user-level threading. User-level threading implements a second layer of thread infrastructure on top of OS-threads (e.g. thread implementations provided by the operating system or kernel). This form of threading is also called hybrid or M:N (mapping N user threads onto M OS-threads) threading.

We recently conducted a benchmark of the scalability of lightweight user-level threads in the face of extremely fine-grained parallelism. Fine-grained parallelism refers to the division of work into very small parallel tasks. By making the tasks very small, the task scheduler is able to load balance more efficiently in the face of highly dynamic applications.

This article presents details of the benchmark we used, and a comparison of HPX with three other software libraries which provide lightweight user-level threading (Qthreads, TBB and SWARM).

The Benchmark

We used a benchmark which we call the Homogeneous Task Spawn benchmark. The benchmark is rather simple: a serial for loop spawns T tasks, with each task doing a fixed workload W that involves no synchronization or communication. We implement the workload as follows for all four libraries that we tested:

double volatile d = 0.;
for (uint64_t i = 0; i < delay; ++i)
    d += 1 / (2. * i + 1);

To determine the value of W in walltime, we run a baseline code that uses a high precision timer to determine the time B that the serial execution of the above code N times takes. Then, we can compute W as W = B / N.

The source code for all benchmarks can be found in the HPX 0.8.0 release, in the tests/performance directory. Alternatively, you can access the individual tests directly here: HPX, Qthreads, TBB, SWARM.

The Results

Here are the results of the benchmarks. We ran them on an HP DL785 G6 node with 48 cores (8 sockets, AMD Opterons) and 96G RAM (DDR2 553MHz). The test machine was running Debian Linux (kernel version 3.1). The benchmarks were run on February 7th, 2012.

The four libraries benchmarked were:

As you can see from the results, HPX and TBB are closely tied for best performance when there is no arbitrary workload (0µs). This workload is the ultimate test of fine-grained parallelism. While we were pleased with our results here, we’d like to improve our performance at this level.

For 100µs and 1000µs workloads, HPX shows excellent, stable scaling curves. Even after HPX reaches the point of saturation (the point at which using more parallel processing units adds more overhead than speedup), HPX’s degradation is very slow and stable. The other libraries degrade much more quickly.

It is also interesting to note the consistency of the data from HPX and TBB, in contrast to the data from SWARM and Qthreads. We did multiple trials of each data point and averaged the results, so we do not believe that the discrepancies are negligible.

We hope to learn more from the results of this benchmark in the coming months. Analysis of this benchmark should allow us to improve HPX’s threading system to enable new levels of fine-grained parallelism.

GD Star Rating
loading...
Benchmarking User Level Threads, 4.3 out of 5 based on 6 ratings

    12 thoughts on “Benchmarking User Level Threads

    1. Hi Bryce,

      Can you comment on the versions of the different codes that you tested against? Also, can you comment on how “number of cores” is expressed in the ratio of number of OS threads to the number of work queues – i.e., is there one work queue per OS thread, one work queue shared by all OS threads, or some other mapping – for the Qthreads runs in particular?

      ,Dylan

      GD Star Rating
      loading...
      • Dylan,

        We ran with QT_NUM_SHEPHERDS = 1, 2, 3, 4, 5, 6, 7, 8 (there were 8 sockets on the machine) and QT_NUM_WORKERS_PER_SHEPHERD = 6 (6 cores per socket). For the 1 – 5 OS-threads case, we ran with 1 shepherd and 1 – 5 workers.

        You can find the source code here

        GD Star Rating
        loading...
            • I was asking about the versions of the different codebases. Your write-up only specifies that HPX 0.8.0 was used. It does not indicate which versions of TBB, SWARM, or Qthreads you used to generate these numbers. Can you comment on the versions of the different codes that you tested against?

              GD Star Rating
              loading...
            • Sure!

              Qthreads – SVN r2881
              TBB – Debian 4.0+r233-1
              SWARM – 0.6.1

              GD Star Rating
              loading...
    2. Any reason you chose an SVN revision of Qthreads that was almost two months older than the release of HPX?

      Perhaps you could try the most recent 1.7.1 release of Qthreads, or at least the head of SVN to get a more accurate comparison.

      GD Star Rating
      loading...
      • Dylan,

        can I infer from your comments that you have been able to improve Qthreads to perform better on those benchmarks? That would be awesome. In that case you could have simply asked us to redo the benchmarks with your recently released version and we would have tried to do that. Congrats to your release, btw!

        Regards Hartmut

        GD Star Rating
        loading...
    3. Thanks for the details. I wanted more information about the parameters of this experiment so that I could draw more appropriate conclusions.

      To your question, there was a performance bug with serial spawn-loops that was fixed, as luck would have it, two commits (and six hours) after the revision you tested (r2883). The 1.7.1 release, in late February, should have very different performance characteristics on that specific benchmark – when I saw your graph, I assumed you had used the most recent release, and was worried that the bug had re-emerged somehow. I’m glad to know it was just an unfortunate SVN revision.

      GD Star Rating
      loading...
      • Is there a plan to compare the most recent versions of Qthreads and HPX? I would be very interested in seeing the results. I use Qthreads for my current projects and would like to see a fair comparison before deciding how to proceed with a project I am just beginning. Thank you.

        GD Star Rating
        loading...

    Leave a Reply

    Your email address will not be published. Required fields are marked *