Performance tuning
MPI
The vast majority of parallel scientific applications use the MPI library as the main way to implement parallelism; it is used so universally that the Cray compiler wrappers on ARCHER2 link to the Cray MPI library by default. Unlike other clusters you may have used, there is no choice of MPI library on ARCHER2: regardless of what compiler you are using, your program will use Cray MPI. This is because the Slingshot network on ARCHER2 is Cray-specific and significant effort has been put in by Cray software engineers to optimise the MPI performance on their Shasta systems.
Here we list a number of suggestions for improving the performance of your MPI programs on ARCHER2. Although MPI programs are capable of scaling very well due to the bespoke communications hardware and software, the details of how a program calls MPI can have significant effects on achieved performance.
Note
Many of these tips are actually quite generic and should be beneficial to any MPI program; however, they all become much more important when running on very large numbers of processes on a machine the size of ARCHER2.
MPI environment variables
There are a number of environment variables available to control aspects of MPI behavour on ARCHER2, the set of options can be displayed by running,
man intro_mpi
A couple of specific variables to highlight are MPICH_OFI_STARTUP_CONNECT and MPICH_OFI_RMA_STARTUP_CONNECT.
When using the default OFI transport layer the connections between ranks are set-up as they are required. This allows for good performance while reducing memory requirements. However for jobs using all-to-all communication it might be better to generate these connections in a coordinated way at the start of the application. To enable this set the following environment variable:
export MPICH_OFI_STARTUP_CONNECT=1
Additionally, RMA jobs requiring an all-to-all communication pattern on node it may be beneficial to set up the connections between processes on a node in a coordinated fashion:
export MPICH_OFI_RMA_STARTUP_CONNECT=1
This option automatically enables MPICH_OFI_STARTUP_CONNECT.
Synchronous vs asynchronous communications
MPI_Send
A standard way to send data in MPI is using MPI_Send
(aptly called
standard send). Somewhat confusingly, MPI is allowed to choose how to
implement this in two different ways:
-
Synchronously
The sending process waits until a matching receive has been posted, i.e. it operates likeMPI_Ssend
. This clearly has the risk of deadlock if no receive is ever issued. -
Asynchronously
MPI makes a copy of the message into an internal buffer and returns straight away without waiting for a matching receive; the message may actually be delivered later on. This is like the behaviour of the the buffered send routineMPI_Bsend
.
The rationale is that MPI, rather than the user, should decide how best to send a message.
In practice, what typically happens is that MPI tries to use an asynchronous approach via the eager protocol: the message is sent directly to a preallocated buffer on the receiver and the routine returns immediately afterwards. Clearly there is a limit on how much space can be reserved for this, so:
- small messages will be sent asynchronously;
- large messages will be sent synchronously.
The threshold is often termed the eager limit which is fixed for the entire run of your program. It will have some default setting which varies from system to system, but might be around 8K bytes.
Implications
- An MPI program will typically run faster if
MPI_Send
is implemented asynchronously using the eager protocol since synchronisation between sender and receive is much reduced. - However, you should never assume that
MPI_Send
buffers your message, so if you have concerns about deadlock you will need to use the non-blocking variantMPI_Isend
to guarantee that the send routine returns control to you immediately even if there is no matching receive. - It is not enough to say deadlock is an issue in principle, but it runs OK on my laptop so there is no problem in practice. The eager limit is system-dependent so the fact that a message happens to be buffered on your laptop is no guarantee it will be buffered on ARCHER2.
- To check that you have a correct code, replace all instances of
MPI_Send
/MPI_Isend
withMPI_Ssend
/MPI_Issend
. A correct MPI program should still run correctly when all references to standard send are replaced by synchronous send (since MPI is allowed to implement standard send as synchronous send).
Tuning performance
With most MPI libraries you should be able to alter the default value of
the eager limit at runtime, perhaps via an environment variable or a
command-line argument to mpirun
.
The advice for tuning the performance of MPI_Send
is
- find out what the distribution of message sizes for
MPI_Send
is (a profiling tool may be useful here); - this applies to
MPI_Isend
as well: even in the non-blocking form, which can help to weaken synchronisation between sender and receiver, the amount of hand-shaking required is much reduced if the eager protocol is used; - find out from the system documentation how to alter the value of the eager limit (there is no standardised way to set it);
- set the eager limit to a value larger than your typical message size -- you may need to add a small amount, say a few hundred bytes, to allow for any additional header information that is added to each message;
- measure the performance before and after to check that it has improved.
Note
It cannot be stressed strongly enough that although the performance may be affected by the value of the eager limit, the functionality of your program should be unaffected. If changing the eager limit affects the correctness of your program (e.g. whether or not it deadlocks) then you have an incorrect MPI program.
Setting the eager limit on ARCHER2
On ARCHER2, things are a little more complicated. Although the eager limit defaults to 16KiB, messages up to 256KiB are sent asynchronously because they are actually sent as a number of smaller messages.
To send even larger messages asynchronously, alter the value of
FI_OFI_RXM_SAR_LIMIT
in your job submission script, e.g. to set to 512KiB:
export FI_OFI_RXM_SAR_LIMIT=524288
You can also control the size of the smaller messages by altering the value
of FI_OFI_RXM_BUFFER_SIZE
in your job submission script, e.g. to set to 128KiB:
export FI_OFI_RXM_BUFFER_SIZE=131072
A different protocol is used for messages between two processes on the same node. The default eager limit for these is 8K. Although the performance of on-node messages is unlikely to be a limiting factor for your program you can change this value, e.g. to set to 16KiB:
export MPICH_SMP_SINGLE_COPY_SIZE=16384
Collective operations
Many of the collective operations that are commonly required by parallel scientific programs, i.e. operations that involve a group of processes, are already implemented in MPI. The canonical operation is perhaps adding up a double precision number across all MPI processes, which is best achieved by a reduction operation:
MPI_Allreduce(&x, &xsum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
This will be implemented using an efficient algorithm, for example based on a binary tree. Using such divide-and-conquer approaches typically results in an algorithm whose execution time on P processes scales as log_2(P); compare this to a naive approach where every process sends its input to rank 0 where the time will scale as P. This might not be significant on your laptop, but even on as few as 1000 processes the tree-based algorithm will already be around 100 times faster.
So, the basic advice is always use a collective routine to implement your communications pattern if at all possible.
In real MPI applications, collective operations are often called on a small amount of data, for example a global reduction of a single variable. In these cases, the time taken will be dominated by message latency and the first port of call when looking at performance optimisation is to call them as infrequently as possible!
- If you are simply printing diagnostics to the screen in an iterative loop, consider doing this less frequently, e.g every ten iterations, or even not at all (although you should easily be able to turn diagnostics on again for future debugging).
- If you are computing some termination criterion, it may actually be faster overall to compute it and check for convergence infrequently, e.g. every ten iterations, even although this means that your program could run for up to 9 extra iterations.
- If possible, group data into a single buffer and call a single reduction with count > 1; two reductions with count = 1 will take almost exactly twice as long as a single reduction with count = 2.
- For example, if you only need to output a sequence of summed data at the end of the run, store the partial totals in an array and do a single reduction right at the end.
Sometimes, the collective routines available may not appear to do exactly what you want. However, they can sometimes be used with a small amount of additional programming work:
-
To operate on a subset of processes, create sub-communicators containing the relevant subset(s) and use these communicators instead of
MPI_COMM_WORLD
. Useful functions for communicator management include:MPI_Comm_split
is the most general routine;MPI_Comm_split_type
can be used to create a separate communicator for each shared-memory node withsplit type = MPI_COMM_TYPE_SHARED
;MPI_Cart_sub
can divide a Cartesian communicator into regular slices.
-
If the communication pattern is what you want, but the data on each process is not arranged in the required layout, consider using MPI derived data types for the input and/or output. This can be useful, for example, if you want to communicate non-contiguous data such as a subsection of a multidimensional array although care must be taken in defining these types to ensure they have the correct extents.
Another example would be using
MPI_Allreduce
to add up an integer and a double-precision variable using a single call by putting them together into a Cstruct
and defining a matching MPI datatype usingMPI_Type_create_struct
. Here you would also have to provide MPI with a custom reduction operation usingMPI_Op_create
.
Many MPI programs call MPI_Barrier
to explicitly synchronise all the
processes. Although this can be useful for getting reliable performance
timings, it is rare in practice to find a program where the call is
actually needed for correctness. For example, you may see:
// Ensure the input x is available on all processes
MPI_Barrier(MPI_COMM_WORLD);
// Perform a global reduction operation
MPI_Allreduce(&x, &xsum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
// Ensure the result xsum is available on all processes
MPI_Barrier(MPI_COMM_WORLD);
Neither of these barriers are needed as the reduction operation performs all the required synchronisation.
If removing a barrier from your MPI code makes it run incorrectly, then this should ring alarm bells -- it is often a symptom of an underlying bug that is simply being masked by the barrier.
For example, if you use non-blocking calls such as MPI_Irecv
then it
is the programmer's responsibility to ensure that these are completed at
some later point, for example by calling MPI_Wait
on the returned
request object. A common bug is to forget to do this, in which case you
might be reading the contents of the receive buffer before the incoming
message has arrived (e.g. if the sender is running late).
Calling a barrier may mask this bug as it will make all the processes wait for each other, perhaps allowing the late sender to catch up. However, this is not guaranteed so the real solution is to call the non-blocking communications correctly.
One of the few times when a barrier may be required is if processes are communicating with each other via some other non-MPI method, e.g. via the file system. If you want processes to sequentially open, append to, then close the same file then barriers are a simple way to achieve this:
for (i=0; i < size; i++)
{
if (rank == i) append_data_to_file(data, filename);
MPI_Barrier(comm);
}
but this is really something of a special case.
Global synchronisation may be required if you are using more advanced
techniques such as hybrid MPI/OpenMP or single-sided MPI communication
with put and get, but typically you should be using specialised routines
such as MPI_Win_fence
rather than MPI_Barrier
.
Tip
If you run a performance profiler on your code and it shows a lot of
time being spent in a collective operation such as MPI_Allreduce
, this
is not necessarily a sign that the reduction operation itself is the
bottleneck. This is often a symptom of load imbalance: even if a
reduction operation is efficiently implemented, it may take a long time
to complete if the MPI processes do not all call it at the same time.
MPI_Allreduce
synchronises across processes so will have to wait for
all the processes to call it before it can complete. A single slow
process will therefore adversely impact the performance of your entire
parallel program.
OpenMP
There are a variety of possible issues that can result in poor performance of OpenMP programs. These include:
Sequential code
Code outside of parallel regions is executed sequentially by the master thread.
Idle threads
If different threads have different amounts of computation to do, then threads may be idle whenever a barrier is encountered, for example at the end of parallel regions or the end of worksharing loops. For worksharing loops, choosing a suitable schedule kind may help. For more irregular computation patterns, using OpenMP tasks might offer a solution: the runtime will try to load balance tasks across the threads in the team.
Synchronisation mechanisms that enforce mutual exclusion, such as critical regions, atomic statements and locks can also result in idle threads if there is contention - threads have to wait their turn for access.
Synchronisation
The act of synchronising threads comes at some cost, even if the threads are never idle. In OpenMP, the most common source of synchronisation overheads is the implicit barriers at the end of parallel regions and worksharing loops. The overhead of these barriers depends on the OpenMP implementation being used as well as on the number of threads, but is typically in the range of a few microseconds. This means that for a simple parallel loop such as
#pragma omp parallel for reduction(+:sum)
for (i=0;i<n;i++){
sum += a[i];
}
the number of iterations required to make parallel execution worthwhile may be of the order of 100,000. On ARCHER2, benchmarking has shown that for the AOCC compiler, OpenMP barriers have significantly higher overhead than for either the Cray or GNU compilers.
It is possible to suppress the implicit barrier at the end of
worksharing loop using a nowait
clause, taking care that this does
not introduce and race conditions.
Atomic statements are designed to be capable of more efficient implementation that the equivalent critical region or lock/unlock pair, so should be used where applicable.
Scheduling
Whenever we rely on the OpenMP runtime to dynamically assign computation to threads (e.g. dynamic or guided loop schedules, tasks), there is some overhead incurred (some of this cost may actually be internal synchronisation in the runtime). It is often necessary to adjust the granularity of the computation to find a compromise between too many small units (and high scheduling cost) and too few large units (where load imbalance may dominate). For example, we can choose a non-default chunksize for the dynamic schedule, or adjust the amount of computation within each OpenMP task construct.
Communication
Communication between threads in OpenMP takes place via the cache coherency mechanism. In brief, whenever a thread writes a memory location, all copies of this location which are in a cache belonging to a different core have to be marked as invalid. Subsequent accesses to this location by other threads will result in the up-to-date value being retrieved from the cache where the last write occurred (or possibly from main memory).
Due to the fine granularity of memory accesses, these overheads are difficult to analyse or monitor. To minimise communication, we need to write code with good data affinity - i.e. each thread should access the same subset of program data as much as possible.
NUMA effects
On modern CPU nodes, main memory is often organised in NUMA regions - sections of main memory associated with a subset of the cores on a node. On ARCHER2 nodes, there are 8 NUMA regions per node, each associated with 16 CPU cores. On such systems the location of data in main memory with respect to the cores that are accessing it can be important. The default OS policy is to place data in the NUMA region which first accesses it (first touch policy). For OpenMP programs this can be the worst possible option: if the data is initialised by the master thread, it is all allocated one NUMA region and having all threads accessing data becomes a bandwidth bottleneck.
This default policy can be changed using the numactl
command, but it
is probably better to make use of the first touch policy by explicitly
parallelising the data initialisation in the application code. This
may be straightforward for large multidimensional arrays, but more
challenging for irregular data structures.
False sharing
The cache coherency mechanism described above operates on units of data corresponding to the size of cache lines - for ARCHER2 CPUs this is 64 bytes. This means that if different threads are accessing neighbouring words in memory, and at least some of the accesses are writes, then communication may be happening even if no individual word is actually being accessed by more than one thread. This means that patterns such as
#pragma omp parallel shared(count) private(myid)
{
myid = omp_get_thread_num();
....
count[myid]++;
....
}
may give poor performance if the updates to the count
array are
sufficiently frequent.
Hardware resource contention
Whenever there are multiple threads (or processes) executing inside a node, they may contend for some hardware resources. The most important of these for many HPC applications is memory bandwidth. This is effect is very evident on ARCHER2 CPUs - it is possible for just 2 threads to almost saturate the available memory bandwidth in a NUMA region which has 16 cores associated with it. For very bandwidth-intensive applications, running more that 2 threads per NUMA region may gain little additional performance. If an OpenMP code is not using all the cores on a node, by default Slurm will spread the threads out across NUMA regions to maximise the available bandwidth.
Another resource that threads may contend for is space in shared caches. On ARCHER2, every set of 4 cores shares 16MB of L3 cache.
Compiler non-optimisation
In rare cases, adding OpenMP directives can adversely affect the compiler's optimisation process. The symptom of this is that the OpenMP code running on 1 thread is slower than the same code compiled without the OpenMP flag. It can be difficult to find a workaround - using the compiler's diagnostic flags to find out which optimisation (e.g. vectorisation, loop unrolling) is being affected and adding compiler-specific directives may help.
Hybrid MPI and OpenMP
There are two main motivations for using both MPI and OpenMP in the same application code: reducing memory requirements and improving performance. At low core counts, where the pure MPI version of the code is still scaling well, adding OpenMP is unlikely to improve performance. In fact, it can introduce some additional overheads which make performance worse! The benefit is likely to come in the regime where the pure MPI version starts to lose scalability - here adding OpenMP can reduce communication costs, make load balancing easier, or be an effective way of exploiting additional parallelism without excessive code re-writing.
An important performance consideration for MPI + OpenMP applications is the choice of the number of OpenMP threads per MPI process. The optimum value will depend on the application, the input data, the number of nodes requested and the choice of compiler, and is hard to predict without experimentation. However, there are some considerations that apply to ARCHER2:
-
Due to NUMA effects, it is likely that running at least one MPI process per NUMA region (i.e. at least 8 MPI processes per node) will be beneficial.
-
The number of MPI processes per node should be a power of 2, so that all OpenMP threads run in the same NUMA region as their parent MPI process.
-
For applications where each process has a small memory footprint (e.g. some molecular dynamics codes), running no more than 4 OpenMP threads per MPI process may be beneficial, so that all the threads in a process share a single L3 cache.