HPX - High Performance ParalleX

More Details about HPX Command Line Options

This section documents the following list of the command line options in more detail:

The Command Line Option --hpx:bind

This command line option allows one to specify the required affinity of the HPX worker threads to the underlying processing units. As a result the worker threads will run only on the processing units identified by the corresponding bind specification. The affinity settings are to be specified using --hpx:bind=<BINDINGS>, where <BINDINGS> have to be formatted as described below.

[Note] Note

This command line option is only available if HPX was built with support for HWLOC (Portable Hardware Locality (HWLOC)) enabled. Please see Options and Variables for more details on how to enable support for HWLOC in HPX.

The specified affinities refer to specific regions within a machine hardware topology. In order to understand the hardware topology of a particular machine it may be useful to run the lstopo tool which is part of Portable Hardware Locality (HWLOC) to see the reported topology tree. Seeing and understanding a topology tree will definitely help in understanding the concepts that are discussed below.

Affinities can be specified using HWLOC (Portable Hardware Locality (HWLOC)) tuples. Tuples of HWLOC objects and associated indexes can be specified in the form object:index, object:index-index, or object:index,...,index. HWLOC objects represent types of mapped items in a topology tree. Possible values for objects are socket, numanode, core, and pu (processing unit). Indexes are non-negative integers that specify a unique physical object in a topology tree using its logical sequence number.

Chaining multiple tuples together in the more general form object1:index1[.object2:index2[...]] is permissible. While the first tuple's object may appear anywhere in the topology, the Nth tuple's object must have a shallower topology depth than the (N+1)th tuple's object. Put simply: as you move right in a tuple chain, objects must go deeper in the topology tree. Indexes specified in chained tuples are relative to the scope of the parent object. For example, socket:0.core:1 refers to the second core in the first socket (all indices are zero based).

Multiple affinities can be specified using several --hpx:bind command line options or by appending several affinities separated by a ';'. By default, if multiple affinities are specified, they are added.

"all" is a special affinity consisting in the entire current topology.

[Note] Note

All 'names' in an affinity specification, such as thread, socket, numanode, pu, or all can be abbreviated. Thus the affinity specification threads:0-3=socket:0.core:1.pu:1 is fully equivalent to its shortened form t:0-3=s:0.c:1.p:1.

Here is a full grammar describing the possible format of mappings:









    'socket' | 'numanode'

The following example assumes a system with at least 4 cores, where each core has more than 1 processing unit (hardware threads). Running hello_world with 4 OS-threads (on 4 processing units), where each of those threads is bound to the first processing unit of each of the cores, can be achieved by invoking:

hello_world -t4 --hpx:bind=thread:0-3=core:0-3.pu:0

Here thread:0-3 specifies the OS threads for which to define affinity bindings, and core:0-3.pu:0 defines that for each of the cores (core:0-3) only their first processing unit (pu:0) should be used.

[Note] Note

The command line option --hpx:print-bind can be used to print the bitmasks generated from the affinity mappings as specified with --hpx:bind. For instance, on a system with hyperthreading enabled (i.e. 2 processing units per core), the command line:

hello_world -t4 --hpx:bind=thread:0-3=core:0-3.pu:0 --hpx:print-bind

will cause this output to be printed:

0: PU L#0(P#0), Core L#0, Socket L#0, Node L#0(P#0)
1: PU L#2(P#2), Core L#1, Socket L#0, Node L#0(P#0)
2: PU L#4(P#4), Core L#2, Socket L#0, Node L#0(P#0)
3: PU L#6(P#6), Core L#3, Socket L#0, Node L#0(P#0)

where each bit in the bitmasks corresponds to a processing unit the listed worker thread will be bound to run on.

The difference between the three possible predefined distribution schemes (compact, scatter, and balanced) is best explained with an example. Imagine that we have a system with 4 cores and 4 hardware threads per core. If we place 8 threads the assignments produced by the compact, scatter, and balanced types are shown in eh figure below. Notice that compact does not fully utilize all the cores in the system. For this reason it is recommended that applications are run using the scatter or balanced options in most cases.

Figure 2. Schematic of thread affinity type distributions

Schematic of thread affinity type distributions