LIKWID
LIKWID is an open-source tool suite that can be used to measure node-level hardware performance counters, amongst other functionalities. It offers ways to quantify performance, e.g. when investigating performance bottlenecks or degradation. In this documentation we present guidance for common use cases on ARCHER2, focusing on performance counter measurement for parallel applications. For more information on LIKWID functionality and usage see the official LIKWID wiki.
likwid-perfctr and likwid-mpirun
LIKWID provides a number of command line tools. Performance counter
measurement is handled by
likwid-perfctr
,
which supports the following usage modes:
-
wrapper mode (default): use
likwid-perfctr
as a wrapper to launch your application and measure performance counters while it executes. This is the easiest way to ensure that measurement starts and stops when your application starts and stops and that the cores whose counters are included in measurements are those on which your application executes.- wrapper mode + marker API: when using wrapper mode it is possible to measure performance counters only during execution of one or more specific regions in your code, for example to quantify the performance of known computationally costly kernels. This requires instrumenting your code to use the LIKWID marker API and recompiling.
-
stethoscope mode: launch your application as usual, then instruct
likwid-perfctr
to measure performance counters associated with the cores on which your application is executing. Measurements are aggregated over a duration of time that you specify. It may be difficult to relate results to what application code was executed over the measurement duration. This mode is more suited to obtaining a snapshot of performance than to performing a systematic performance assessment. -
timeline mode: launch your application as usual, then instruct
likwid-perfctr
to periodically output performance counter measurements aggregated over the time interval specified. As for stethoscope mode you must telllikwid-perfctr
which cores to measure counters for, ensuring these match where your application is executing. This mode can provide insight into performance during different phases of your application, though it may be difficult to relate results to what application code was executed over any given measurement interval.
Using likwid-perfctr
in any of the above modes other than wrapper +
marker API can be done without altering or recompiling your
application code.
likwid-perfctr
is designed to work with serial or thread-parallel
applications. Measuring counters for MPI-parallel applications may be
accomplished in principle by combining likwid-perfctr
and a parallel
application launcher such as srun
as described in this
tutorial.
LIKWID provides a more elegant solution in the form of a wrapper
called
likwid-mpirun
.
This launches likwid-perfctr
and your MPI-parallel application and
aggregates measurement outputs across ranks.
Note
likwid-mpirun
only supports likwid-perfctr
's wrapper mode
(with or without marker API)
LIKWID on ARCHER2
For the sake of simplicity and convenience we provide guidance on how
to use likwid-mpirun
to measure performance counters on ARCHER2
regardless of whether the application is serial, thread-parallel,
MPI-parallel, or hybrid MPI + thread-parallel. This unified approach
has the added benefit of being closer to the ARCHER2 default
srun
-based application launch approach used in existing job scripts
than using likwid-perfctr
.
Using likwid-mpirun
restricts measurement functionality to
likwid-perfctr
's wrapper mode, which supports performance
characterisation through either whole-application or kernel-specific
measurement. Users interested in running likwid-perfctr
directly on
ARCHER2, for example to access timeline or stethoscope mode, may
consult the example job script for pure threaded applications that
uses likwid-perfctr
below as well as the wiki page on LIKWID's
pinning syntax,
and can contact the ARCHER2 helpdesk to request assistance.
LIKWID is available on ARCHER2 as a centrally installed module
(likwid/5.4.1-archer2
), which provides a customised
version of the
official 5.4.1 release. The primary difference compared to the
official release is that likwid-mpirun
has been adapted for improved
compatibility with ARCHER2 Slurm configuration, default job launch
recommendations and the Cray MPI library. It offers the --nocpubind
option, which we introduced and recommend using to leave binding of
application processes to CPUs to srun
to control as per the standard
approach for running jobs on ARCHER2. The central LIKWID install also
incorporates several more minor
changes. The
guidance and example job scripts presented below pertain to this
custom version and its usage on ARCHER2.
Note
LIKWID on ARCHER2 uses the perf-event
backend
with perf_event_paranoid
set to -1 (no restrictions), which has
some implications for
features/functionality
Summary of likwid-mpirun options
The following options are important to be aware of when using
likwid-mpirun
on ARCHER2. For additional information, try
likwid-mpirun --help
and see the LIKWID wiki, especially the
likwid-mpirun
page.
-n/-np <count>
Specify the total number of processes to launch
-t <count>
The number of threads or threads per MPI rank. Defaults to 1 if not
specified. Can be used to "space" processes for placement in the case
of hybrid MPI + threaded applications and act as an alternative to
-pin
below in these cases.
-pin <list>
Specify pinning of processes (and their threads if
applicable). Follows the
likwid-pin
syntax.
-g/--group <perf>
Specify which predefined group of performance counters and derived metrics to measure and compute. Details about these groups and available counters for the Zen2 architecture of ARCHER2's AMD EPYC processors can be found at https://github.com/RRZE-HPC/likwid/wiki/Zen2.
--nocpubind
(ARCHER2 only)
Suppress likwid-mpirun
's binding of application processes to CPUs
(cores) that would otherwise take place through generation of a CPU
mask list passed to srun
. We recommend always using --nocupbind
on
ARCHER2 to avoid conflicts with binding/affinity specified following
the usual approach on ARCHER2 and this is shown in example job
scripts.
-s/--skip <hex>
This tells likwid-mpirun
how many threads to skip when generating
its specification of which cores to pin to/measure (see this
discussion of LIKWID and shepherd
threads for a detailed
explanation). On ARCHER2 we have checked and confirmed that there are
no shepherd threads involved using any of PrgEnv-gnu
, PrgEnv-cray
or PrgEnv-aocc
, therefore not skipping any threads (-s 0x0
) is the
correct choice (this differs from likwid-mpirun
defaults), reflected
in the example job scripts above.
-d/--debug
To check how exactly likwid-mpirun
calls srun
and likwid-perfctr
to launch and measure your application, use the --debug
option,
which will generate additional output that includes the relevant
commands. For the modified likwid-mpirun
on ARCHER2 the otherwise
temporary files .likwidscript_*.txt
referenced in debug output that
contain these commands persist after execution, enabling closer
inspection.
--mpiopts
Any desired options accepted by srun
can be passed to the srun
command through the --mpiopts
option, for example as follows:
likwid-mpirun --mpiopts "--exact --hint=nomultithread --distribution=block:block"
-m/-marker
Activate Marker API mode
Example job scripts
Below we provide example job scripts covering different cases of
application parallelism using likwid-perfctr
in wrapper mode either
through likwid-mpirun
or directly. All examples perform
whole-application measurement but can be adapted to use the marker
API.
Each of the example job scripts that uses likwid-mpirun
makes use of
the srun/sbatch options --hint=nomultithread
and
--distribution=block:block
. We have set these as SBATCH options at
the top of the job script. Alternatively these as well as any other
srun
options could be passed explicitly to the underlying srun
command through the --mpiopts
option as follows:
likwid-mpirun --mpiopts "--hint=nomultithread --distribution=block:block"
Each example job script includes a suggested command to run xthi
launched identically to the application you wish to measure in order
to check and confirm that process and thread placement for your
application is as intended. Details on checking process placement with
xthi
can be found in the User Guide page on Running
jobs.
Note
You are encouraged to check how likwid-mpirun
uses srun
to
launch likwid-perfctr
and your application by running in debug
mode (likwid-mpirun --debug
) and examining both the job output
and the .likwidscript-####
file mentioned therein.
For pure threaded and MPI+thread parallel jobs using either the -t
option to specify the number of threads or -pin
with an appropriate
pinning expression (or both) can accomplish the same desired
application placement and measurement scenario. The same applies to
pure MPI applications in the case of underpopulating nodes (i.e. fewer
than 128 ranks per node), where -t
can be used to space processes
out across a node and/or -pin
used to specify which cores on each
node application processes should execute and be measured on.
For further explanation of likwid-mpirun
options used, see the
section Summary of likwid-mpirun
options
Pure MPI jobs
Fully populated node(s)
Two fully populated nodes (128 ranks per node):
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=2
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --hint=nomultithread
#SBATCH --distribution=block:block
module load likwid
module load xthi
export OMP_NUM_THREADS=1
export SRUN_CPUS_PER_TASK=1
likwid-mpirun -n $SLURM_NTASKS --nocpubind -s 0x0 -g FLOPS_DP --debug xthi_mpi &> xthi_mpi.out
likwid-mpirun -n $SLURM_NTASKS --nocpubind -s 0x0 -g FLOPS_DP myApplication &> application.out
Underpopulated node(s)
Two nodes, two ranks per node, one rank per socket (i.e. per 64-core AMD EPYC processor):
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=64
#SBATCH --hint=nomultithread
#SBATCH --distribution=block:block
module load likwid
module load xthi
export OMP_NUM_THREADS=1
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
likwid-mpirun -n $SLURM_NTASKS -pin N:0_N:64 --nocpubind -s 0x0 -g FLOPS_DP --debug xthi_mpi &> xthi_mpi.out
likwid-mpirun -n $SLURM_NTASKS -pin N:0_N:64 --nocpubind -s 0x0 -g FLOPS_DP myApplication &> application.out
The same application placement and measurement scenario can be
accomplished by specifying the first core on each socket directly with
-pin S0:0_S1:0
instead of -pin N:0_N:64
One node, four ranks, one rank per NUMA region all on the same socket:
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=4
#SBATCH --cpus-per-task=16
#SBATCH --hint=nomultithread
#SBATCH --distribution=block:block
module load likwid
module load xthi
export OMP_NUM_THREADS=1
export SRUN_CPUS_PER_TASK=16
likwid-mpirun -n $SLURM_NTASKS -pin N:0_N:16_N:32_N:48 --nocpubind -s 0x0 -g FLOPS_DP --debug xthi_mpi &> xthi_mpi.out
likwid-mpirun -n $SLURM_NTASKS -pin N:0_N:16_N:32_N:48 --nocpubind -s 0x0 -g FLOPS_DP myApplication &> application.out
The same application placement and measurement scenario can be
accomplished by specifying the first core on each NUMA node directly
with -pin M0:0_M1:0_M2:0_M3:0
instead of -pin N:0_N:16_N:32_N:48
Pure threaded jobs
Fully populated node
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --hint=nomultithread
#SBATCH --distribution=block:block
module load likwid
module load xthi
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
export SRUN_CPUS_PER_TASK=128
likwid-mpirun -n 1 -t 128 --nocpubind -s 0x0 -g FLOPS_DP --debug xthi &> xthi.out
likwid-mpirun -n 1 -t 128 --nocpubind -s 0x0 -g FLOPS_DP myApplication &> application.out
The same application placement and measurement scenario can be
accomplished using the pinning option -pin N:0-127
instead of -t
128
.
For pure threaded applications the likwid-perfctr
command can also
be used directly instead of likwid-mpirun
, bypassing srun
. This is
shown below in the equivalent job script to the fully populated
likwid-mpirun
example above.
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=1
module load likwid
module load xthi
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
likwid-perfctr -C N:0-127 -s 0x0 -g FLOPS_DP --debug xthi &> xthi.out
likwid-perfctr -C N:0-127 -s 0x0 -g FLOPS_DP myApplication &> application.out
The -C
option simultaneously sets pinning of application threads to cores and
specifies those same cores to measure counters for. The following pinning
expressions are equivalent:
-C N:0-127
-C E:N:128:1:2
The second form uses LIKWID's expression based pinning syntax (see
the likwid-pin
wiki
page). This can
be understood by examining CPU numbering using the likwid-topology
command, which shows that adjacent hardware threads on the same
physical cores are numbered n and n+128 respectively, hence CPUs
(logical cores) numbered 0 through 127 correspond to single hardware
threads on each of the 128 distinct physical cores on an ARCHER2
compute node. The final :2
in the expression based syntax skips the
second hardware thread for each physical core when pinning, thereby
accomplishing the same as the domain-based pinning expression that
specifies direct core number assignment for the N domain (all cores).
Underpopulated node
Launching fewer than 128 threads placed consecutively is a straightforward variation on the fully occupied node case above. We can achieve more varied placements, for example four threads in total, each bound to the first core of a different CCX complex on the same socket. The 4 cores in a CCX share a common L3 cache, hence this scenario results in none of the threads sharing the same L3 cache.
This is easiest to accomplish using likwid-perfctr
directly rather
than through likwid-mpirun
, as follows:
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=1
module load likwid
module load xthi
export OMP_NUM_THREADS=4
export OMP_PLACES=cores
likwid-perfctr -C 0,4,8,12 -s 0x0 -g FLOPS_DP xthi &> xthi_perfctr_list.out
The same application placement and measurement scenario can be
accomplished using the pinning option -C C0:0@C1:0@C2:0@C3:0
instead. This specifies threads be assigned to the first core of
successive last cache level (L3) domains. The expression syntax
version -C E:N:4:1:8
would achieve the same.
Hybrid MPI+threaded jobs
2 fully populated nodes with 2 ranks per node and 64 threads per rank:
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=64
#SBATCH --hint=nomultithread
#SBATCH --distribution=block:block
module load likwid
module load xthi
export OMP_NUM_THREADS=64
export OMP_PLACES=cores
export SRUN_CPUS_PER_TASK=64
likwid-mpirun -n $SLURM_NTASKS -t 64 --nocpubind -s 0x0 -g FLOPS_DP --debug xthi &> xthi.out
likwid-mpirun -n $SLURM_NTASKS -t 64 --nocpubind -s 0x0 -g FLOPS_DP myApplication &> application.out
The same application placement and measurement scenario can be
accomplished using the pinning option -pin N:0-63_N:64-127
instead
of -t 64
.
Serial job
Using likwid-mpirun
:
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
module load likwid
module load xthi
export OMP_NUM_THREADS=1
export SRUN_CPUS_PER_TASK=1
likwid-mpirun -n 1 --nocpubind -s 0x0 -g FLOPS_DP --debug xthi &> xthi.out
likwid-mpirun -n 1 --nocpubind -s 0x0 -g FLOPS_DP myApplication &> application.out
Alternatively, using likwid-perfctr
:
#!/bin/bash
#SBATCH --account=[your project]
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --time=00:20:00
#SBATCH --nodes=1
module load likwid
module load xthi
export OMP_NUM_THREADS=1
likwid-perfctr -C 0 --nocpubind -s 0x0 -g FLOPS_DP --debug xthi &> xthi.out
likwid-perfctr -C 0 --nocpubind -s 0x0 -g FLOPS_DP myApplication &> application.out
LIKWID Marker API: instrumenting an application for fine-grained measurement
Another important feature of LIKWID is the ability to measure performance
counters for specified regions of your code, such as a computationally intensive
kernel. You can instrument your code using the LIKWID Marker API to instruct
LIKWID when to start and stop taking measurements. This requires code changes and
recompilation to include the likwid-marker.h
header and to call required macros or functions
as illustrated below. Moreover, the code must be compiled with -DLIKWID_PERFMON
to turn Marker API on and with reference to a location where the header files
are found (-I $LIKWID_DIR/include
). As LD_PRELOAD
mechnaism is used,
the application must be linked dynamically with reference to likwid library
(-L $LIKWID_DIR/lib -llikwid
). Markers are recognised by
likwid-perfctr
and likwid-mpirun
with the -m
option enabled. The location of
the headers and libraries can be viewed via module show likwid
. The module
also sets the value of the$LIKWID_DIR
environment variable on loading. It may also
may be necessary to adjust LD_LIBRARY_PATH
to include $LIBLIKWID_DIR/lib
so that the
library is picked up at runtime.
The example below demonstrates the use of the C API (a similar API is also supported for Fortran and some other languages). The initialisation and closing macros or functions must be called from the serial section of the code.
... // other includes
#include <likwid-marker.h>
...
LIKWID_MARKER_INIT; // macro call to setup measurement system
...
LIKWID_MARKER_REGISTER("myregion"); // recommended to reduce overhead
// if OpenMP is used then this should be called in a parallel region
// same is true for stop/start calls
LIKWID_MARKER_START("myregion");
... // code region of interest
LIKWID_MARKER_STOP("myregion");
LIKWID_MARKER_CLOSE;
Note that to allow for conditional compilation so that the original uninstrumented
code could be recompiled without -DLIKWID_PERFMON
, the macros can be redefined
as empty macros using the #ifndef LIKWID_PERFMON
preprocessor directive.
For an example and more detailed discussion please refer to the
Marker API tutorial.
For each macro there exists a corresponding function which you can call instead.
Additionally, similar APIs are also defined for NVIDIA and AMD GPUs.
Roofline Analysis using LIKWID
The Roofline model allows to determine whether an application or kernel is memory or compute bound by measuring its performance (FLOPS persecond) and operational intensity (FLOPS per byte of memory traffic). This allows us to assess the optimisation potential of the selected application, function or loop. The Roofline is typically defined by measuring attainable performance of a memory-bound code such as a version of the STREAM benchmark (this would appear as a slanted roof in the Roofline plot). For the compute-bound ceiling a theoretical maximum or a dense matrix multiplication kernel can be used or another appropriate compute-bound microbenchmark. Multiple Rooflines can be defined, with no particular order: e.g. RAM vs last-level cache in a cache-aware Roofline for the slanted Rooflines and single vs double precision with or without vector intrinsics such as SSE or AVX could constitute additional horizontal Rooflines. Roofline analysis usually focuses on node-level performance optimisations.
Then the performance of an entire application or of a loop can be plotted under those Rooflines. In can be interpreted as follows. If the point is on the left of the vertical line through ridge point where memory and compute rooflines meet, then it is memory-bound, otherwise compute bound. Compute-bound kernels usually exhibit operational intensity substantially higher than 1. The distance to the roofline represents the opportunities for optimisations. If the performance of a loop is already close to attainable maximum it maybe tricky to find any improvement for it The Roofline model tells us that something may or may not be worth optimising but it does not suggest a specific optimisation.
As described above, Marker API can be used to define regions of interest
such as a hot loop where most compute time is spent, identified via an initial profiling run.
Additionally likwid-bench
can be used to get empirical peak memory and compute bounds
by using a version of stream load benchmark and a peakflops kernels.
Results from likwid-bench
can be used by running the appropriate microbenchmark
(add those lines below to the submissiomn script assuming a single-node run):
likwid-bench -t peakflops_avx_fma -W N:4GB:128:1:2 &> peak_flops.out
likwid-bench -t load_avx -W N:4GB:128:1:2 &> peak_bw.out