HPX 0.7.0 Released

We are very proud to announce the release of version 0.7.0 of our High Performance ParalleX (HPX) runtime system. This is our second formal release, and we would like to thank everyone involved for their hard work which has made this release possible. You can download the release files from the downloads page. The release note are available from here. Please feel free to try the examples and let us know what you think. The best way to get in contact with us is to leave a comment on this page or to send a mail to gopx@cct.lsu.edu.

We have made substantial progress since the 0.6.0 release last August. We have had roughly 1000 commits since the last release, closed approximately 120 tickets (bugs, feature requests, etc.).

This post will expand on three of the most important advances that we have made since the last release.

Expansion of the Performance Counter Framework

The HPX performance counter framework provides an infrastructure for the intrusive instrumentation of HPX applications. Performance counters can collect data from hardware, operating systems, the HPX runtime and HPX applications and expose this data through an uniform interface. An application doesn’t have to be aware of performance counters to be instrumented. However, any application can install its own Performance Counter instances, possibly exposing application specific information. Performance Counters in a running application can be queried via the command line, or by connecting to the application over the parcel transport layer with a special monitoring program.

We added generic command line support for Performance Counters to any application, which uses the function hpx::init to start the runtime system. Here is a summary of those:

–list-counters This option lists all registered Performance Counter names before starting the application. It is probably best used with the option –-exit, which allows to list the Performance Counters only, without executing the application.
–list-counter-infos This option lists detailed information about all registered Performance Counters before starting the application. It is probably best used with the option –-exit, which allows to list the Performance Counters only, without executing the application.
–print-counter This option allows to specify a single Performance Counter name. It will cause the value of the specified Performance Counter to be printed either repeatedly or once before shutting down the application.
–print-counter-interval This option defines the time interval (in milliseconds) for printing the Performance Counters specified with –-print-counter options. The default is to print the values once during application shutdown.

PBS Support

HPX applications are meant to be run on distributed resources, such a clusters. this implies that the executable application code is launched on every node. As most widely used scheduling system for such resources is PBS (Portable Batch System) we added direct command line support for running HPX applications when using a PBS batch scheduler.

Any HPX application now queries various environment variables set by PBS. In addition, you can specify command line options, which makes it easy to write PBS scripts.

–nodes This option allows to specify the list of node names this application is supposed to run on.
–nodefile This option allows to specify the name of the PBS node file to use.
–ifsuffix, –-ifprefix, –iftransform Prepend or append the argument specified by this option to the node names the application runs on to select a specific network interconnect. The last option allows to define a more generic (regular expression based) transformation rule to be applied to all the node names.

Native TLS support

A few weeks ago, we began benchmarking our hpx::lcos::eager_future<> LCO, and we were surprised by the amount of overhead that we observed. Our initial benchmark revealed an amortized overhead of 40 µs for creating, using and deleting one hpx::lcos::eager_future<>.

This overhead, which we deemed to be unacceptable, led us to investigate the major contributing factors to the aforementioned overhead. We had suspected that the use of Boost.Thread’s thread local storage (TLS) implementation (Boost’s thread_specific_ptr<>) might be a source of excessive contention for some time prior to our first eager_future<> benchmark. We use Boost’s thread_specific_ptr<> to store most of our important globals: the runtime pointer, the applier pointer, and the HPX-thread “self” pointer (e.g. the pointer to the HPX-thread that execution is currently happening in).

This figure shows a performance comparison of the thread local storage in the Boost libraries with the native (compiler based) implementation. The first version of this benchmark is in the 0.7.0 release, and the second version (which was used for this graph) is in the HPX SVN repository.

We predicted that native TLS support (e.g. operating-system TLS support exposed by certain compilers, namely GCC and MSVC) would be faster than Boost.Thread’s TLS, albeit less portable. We wrote a set of benchmarks to compare native TLS against Boost.Thread’s TLS. The results were shocking. Our benchmarks showed that Boost.Thread’s TLS not only performed worse than native TLS, but it also scaled far worse than native TLS. The contrast in performance likely comes from the differences in implementations. Boost.Thread uses operating-system synchronization primitives in it’s implementation, which adds contention when multiple threads begin accessing the same TLS variable. On x86-64, most TLS implementations make use of segment registers to store the necessary meta-data to access TLS data, meaning that no synchronization is required (for more information please see here and here).

To ease HPX’s transition to native TLS, we wrote a class compatible to Boost’s thread_specific_ptr<> (named hpx::util::thread_specific_ptr<>) which makes use of native TLS support in x86-64 Linux GCC and MSVC. Switching to this new class in all of HPX drastically reduced the contention and improved overall system performance. When we re-ran our eager_future<> tests, we found that the mentioned overhead dropped to 17 µs. This number still too high and requires improvement, however we were impressed by the effect of this single change on our runtime performance.


All in all, we are very pleased with this release as we were able to stabilize our API, to improve overall performance, to fix quite a number of bugs, and to clean up a lot of things in our code base. The next release is scheduled for March next year and we plan to continue working on all of those issues, however there will be much more work to be done in the area of documentation.

GD Star Rating
loading...
HPX 0.7.0 Released, 5.0 out of 5 based on 3 ratings

    Leave a Reply

    Your email address will not be published. Required fields are marked *