[latexpage]

**NOTE: This article does not represent the whole truth. Actually OpenMP is better, not all relevant things have been exploited. There will be an updated article as soon as possible!**

Many problems can be reduced, or reformulated as such that the solution is equivalent to solving a linear system of equations (LSE) in the form of:

$Ax = b$ where $A \in \mathbb{R}^{NxN}$ and $x, b \in \mathbb{R}^{N}$.

There exist a big variety of algorithms to solve these equations. This post will lay out a possible parallel implementation using HPX and OpenMP of the Gauss-Seidel method.

As an example we use this method to solve a static, second order partial differential equation (PDE):

$-\triangle u + k^2 u = f$

With Dirichlet boundary conditions on a uniform two-dimensional grid. After discretizing this PDE using the central difference method, we are able to use the following matrix to be used with our Solver:

$A = \begin{bmatrix} a & b & \cdots & c & & \\ b & a & b & \cdots & c \\ & \ddots & \ddots & \ddots & & \ddots\\ c & \cdots & b & a & b & \cdots & c \\ & \ddots & & \ddots & \ddots & \ddots & & \ddots & \\ & & c & \cdots & b & a & b& \cdots & c \\ & & & \ddots & & \ddots & \ddots & \ddots & \\ & & & & c & \cdots & b & a & b \\ & & & & & c & \cdots & b & a \end{bmatrix}$ and $x=\begin{bmatrix} u(0,0) \\ u(0,1) \\ \vdots \\ u(0,N_y) \\ \vdots \\ u(x,y) \\ \vdots \\ u(N_x, N_y) \end{bmatrix} $

where $a = \frac{2}{h_x^2} + \frac{2}{h_y^2} + k^2, b = \frac{1}{h_x^2}$ and $c = \frac{1}{h_y^2}$, $h_x$ and $h_y$ are the stepsizes for the space discretization.

Now, our algorithm can be formulated as follows:

for(unsigned iter=0; iter < max_iterations; ++iter) { for(unsigned x = 1; x < N_x-1; ++x) { for(unsigned y = 1; y < N_y-1; ++y) { u(x, y) = ((u(x-1,y) + u(x+1, y)) * b + (u(x, y-1) + u(x, y+1)) * c + f(x,y))/a; } } }

In order to parallelize this algorithm, we first need to look at what dependencies between the different iterations:

- To update a grid point, the update in the previous iteration has to be finished first.
- The update of the top and left grid points of the current iteration needs to be finished
- The update of the bottom and right grid points of the previous iteration needs to finished

An illustration of these dependencies can be seen here:

# Parallelization with OpenMP

In order to parallalize this algorithm with OpenMP, the Red-Black scheme of updating the grid points has been developed:

That way, it is easy to rewrite the above introduce algorithm to be used with OpenMP:

for(unsigned iter=0; iter < max_iterations; ++iter) { #pragma omp parallel for for(unsigned x = 1; x < N_x-1; ++x) { for(unsigned y = (x%2) +1; y < N_y-1; y+=2) { u(x, y) = ((u(x-1,y) + u(x+1, y)) * b + (u(x, y-1) + u(x, y+1)) * c + f(x,y))/a; } } #pragma omp parallel for for(unsigned x = 1; x < N_x-1; ++x) { for(unsigned y = ((x+1)%2) +1; y < N_y-1; y+=2) { u(x, y) = ((u(x-1,y) + u(x+1, y)) * b + (u(x, y-1) + u(x, y+1)) * c + f(x,y))/a; } } }

Due to the implicit global barriers between the loops, all dependencies are met in every iteration, and the algorithm can be run in parallel.

# Parallelization with HPX

Due to the HPX programming model, handling the dependencies is easy! In contrast to the OpenMP, we don’t need to develop a specific scheme for traversing the grid. HPX provides us with futures and promises. These can be used to track dependencies. In other words, while traversing the tree we spawn actions to update a certain element in the grid which explicitly waits until all dependencies are met. This way, we are able to remove all barriers in the code, including the ones between every iteration. However, spawning an action comes with a certain overhead, that means that we have to limit the number of threads to spawn. In the current implementation, that is done by dividing the grid into smaller blocks. The effect on runtime can be seen here:

The next post will explain the techniques used to implement the algorithm in detail.

# Strong Scaling of the two implementations

This graph shows how important removing barriers for parallel applications is. By removing the barriers, strong scaling of the HPX implementation can be observed until 40 cores, while the OpenMP version converges to its maximum scaling rather quickly.

Good strong scaling behavior isn’t everything! The following graph shows that the HPX version has a better wallclock timing than the OpenMP version after more than 4 cores are used.

**GD Star Rating**

*loading...*