Debugging
The following debugging tools are available on ARCHER2:
- Linaro Forge (DDT) is an easy-to-use graphical interface for source-level debugging of compiled C/C++ or Fortran codes. It can also be used for non-interactive debugging, and there is also some limited support for python debugging.
- gdb4hpc is a command-line debugging tool provided by HPE Cray. It works similarly to gdb, but allows the user to debug multiple parallel processes without multiple windows. gdb4hpc can be used to investigate deadlocked code, segfaults, and other errors for C/C++ and Fortran code. Users can single-step code and focus on specific processes groups to help identify unexpected code behavior. (text from ALCF).
- valgrind4hpc is a parallel memory debugging tool that aids in detection of memory leaks and errors in parallel applications. It aggregates like errors across processes and threads to simply debugging of parallel applications.
- STAT generate merged stack traces for parallel applications. Also has visualisation tools.
- ATP provides scalable core file and backtrace analysis when parallel programs crash.
- CCDB Cray Comparative Debugger. Compare two versions of code side-by-side to analyse differences. (Not currently described in this documentation.)
Linaro Forge
The Linaro Forge tool provides the DDT parallel debugger. See:
gdb4hpc
The GNU Debugger for HPC (gdb4hpc) is a GDB-based debugger used to debug applications compiled with CCE, PGI, GNU, and Intel Fortran, C and C++ compilers. It allows programmers to either launch an application within it or to attach to an already-running application. Attaching to an already-running and hanging application is a quick way of understanding why the application is hanging, whereas launching an application through gdb4hpc will allow you to see your application running step-by-step, output the values of variables, and check whether the application runs as expected.
Tip
For your executable to be compatible with gdb4hpc, it will need to be
coded with MPI. You will also need to compile your code with the
debugging flag -g
(e.g. cc -g my_program.c -o my_exe
).
Launching through gdb4hpc
Launch gdb4hpc
:
module load gdb4hpc
gdb4hpc
You will get some information about this version of the program and, eventually, you will get a command prompt:
gdb4hpc 4.5 - Cray Line Mode Parallel Debugger
With Cray Comparative Debugging Technology.
Copyright 2007-2019 Cray Inc. All Rights Reserved.
Copyright 1996-2016 University of Queensland. All Rights Reserved.
Type "help" for a list of commands.
Type "help <cmd>" for detailed help about a command.
dbg all>
We will use launch
to begin a multi-process application within
gdb4hpc. Consider that we are wanting to test an application called
my_exe
, and that we want this to be launched across all 256 processes
in two nodes. We would launch this in gdb4hpc by
running:
dbg all> launch --launcher-args="--account=[budget code] --partition=standard --qos=standard --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --exclusive --export=ALL" $my_prog{256} ./my_ex
Make sure to replace the --account
input to your budget code (e.g.
if you are using budget t01, that part should look like
--account=t01
).
The default launcher is srun
and the --launcher-args="..."
allows
you to set launcher flags for srun
. The variable $my_prog
is a dummy
name for the program being launched and you could use whatever name you
want for it -- this will be the name of the srun
job that will be run.
The number in the brackets {256}
is the number of processes over which
the program will be executed, it's 256 here, but you could use any
number. You should try to run this on as few processors as possible --
the more you use, the longer it will take for gdb4hpc to load the
program.
Once the program is launched, gdb4hpc will load up the program and begin to run it. You will get output to screen something that looks like:
Starting application, please wait...
Creating MRNet communication network...
Waiting for debug servers to attach to MRNet communications network...
Timeout in 400 seconds. Please wait for the attach to complete.
Number of dbgsrvs connected: [0]; Timeout Counter: [1]
Number of dbgsrvs connected: [0]; Timeout Counter: [2]
Number of dbgsrvs connected: [0]; Timeout Counter: [3]
Number of dbgsrvs connected: [1]; Timeout Counter: [0]
Number of dbgsrvs connected: [1]; Timeout Counter: [1]
Number of dbgsrvs connected: [2]; Timeout Counter: [0]
Finalizing setup...
Launch complete.
my_prog{0..255}: Initial breakpoint, main at /PATH/TO/my_program.c:34
The line number at which the initial breakpoint is made (in the above example, line 34) corresponds to the line number at which MPI is initialised. You will not be able to see any parts of the code outside of the MPI region of a code with gdb4hpc.
Once the code is loaded, you can use various commands to move through your code. The following lists and describes some of the most useful ones:
help
-- Lists all gdb4hpc commands. You can runhelp COMMAND_NAME
to learn more about a specific command (e.g.help launch
will tell you about the launch commandlist
-- Will show the current line of code and the 9 lines following. Repeated use oflist
will move you down the code in ten-line chunks.next
-- Will jump to the next step in the program for each process and output which line of code each process is one. It will not enter subroutines. !!! note that there is no reverse-step in gdb4hpc.step
-- Likenext
, but this will step into subroutines.up
-- Go up one level in the program (e.g. from a subroutine back to main).print var
-- Prints the value of variablevar
at this point in the code.watch var
-- Like print, but will print whenever a variable changes value.quit
-- Exits gdb4hpc.
Remember to exit the interactive session once you are done debugging.
Attaching with gdb4hpc
Attaching to a hanging job using gdb4hpc is a great way of seeing which state each processor is in. However, this does not produce the most visually appealing results. For a more easy-to-read program, please take a look at the STAT tool.
In your interactive session, launch your executable as a background task
(by adding an &
at the end of the command). For example, if you are
running an executable called my_exe
using 256 processes, you would
run:
srun -n 256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL \
--account=[budget code] --partition=standard --qos=standard ./my_exe &
Make sure to replace the --account
input to your budget code (e.g.
if you are using budget t01, that part should look like
--account=t01
).
You will need to get the full job ID of the job you have just launched. To do this, run:
squeue -u $USER
and find the job ID associated with this interactive session -- this
will be the one with the jobname bash
. In this
example:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1050 workq my_mpi_j jsindt R 0:16 1 nid000001
1051 workq bash jsindt R 0:12 1 nid000002
the appropriate job id is 1051. Next, you will need to run sstat
on
this job id:
sstat 1051
This will output a large amount of information about this specific job.
We are looking for the first number of this output, which should look
like JOB_ID.##
-- the number after the job ID is the number of slurm
tasks performed in this interactive session. For our example (where
srun
is the first slurm task performed), the number is 1051.0.
Launch gdb4hpc
:
module load gdb4hpc
gdb4hpc
You will get some information about this version of the program and, eventually, you will get a command prompt:
gdb4hpc 4.5 - Cray Line Mode Parallel Debugger
With Cray Comparative Debugging Technology.
Copyright 2007-2019 Cray Inc. All Rights Reserved.
Copyright 1996-2016 University of Queensland. All Rights Reserved.
Type "help" for a list of commands.
Type "help <cmd>" for detailed help about a command.
dbg all>
We will be using the attach
command to attach to our program that
hangs. This is done by writing:
dbg all> attach $my_prog JOB_ID.##
where JOB_ID.##
is the full job ID found using sstat
(in our
example, this would be 1051.0). The name $my_prog
is a dummy-name --
it could be whatever name you like.
As it is attaching, gdb4hpc will output text to screen that looks like:
Attaching to application, please wait...
Creating MRNet communication network...
Waiting for debug servers to attach to MRNet communications network...
Timeout in 400 seconds. Please wait for the attach to complete.
Number of dbgsrvs connected: [0]; Timeout Counter: [1]
...
Finalizing setup...
Attach complete.
Current rank location:
After this, you will get an output that, among other things, tells you which line of your code each process is on, and what each process is doing. This can be helpful to see where the hang-up is.
If you accidentally attached to the wrong job, you can detach by running:
dbg all> release $my_prog
and re-attach with the correct job ID. You will need to change your
dummy name from $my_prog
to something else.
When you are finished using gbd4hpc
, simply run:
dbg all> quit
Do not forget to exit your interactive session.
valgrind4hpc
valgrind4hpc is a Valgrind-based debugging tool to aid in the detection of memory leaks and errors in parallel applications. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior. Valgrind4hpc manages starting and redirecting output from many copies of Valgrind, as well as recombining and filtering Valgrind messages. If your program can be debugged with Valgrind, it can be debugged with valgrind4hpc.
The valgrind4hpc module enables the use of standard valgrind as well as the valgrind4hpc version more suitable to parallel programs.
Using Valgrind with serial programs
Launch valgrind4hpc
:
module load valgrind4hpc
Next, run your executable through valgrind:
valgrind --tool=memcheck --leak-check=yes my_executable
The log outputs to screen. The ERROR SUMMARY
will tell you whether,
and how many, memory errors there
are in your program. Furthermore, if you compile your code using the -g
debugging flag (e.g. gcc -g my_program.c -o my_executable.c
), the log
will point out the code lines where the error occurs.
Valgrind also includes a tool called Massif that can be used to give insight into the memory usage of your program. It takes regular snapshots and outputs this data into a single file, which can be visualised to show the total amount of memory used as a function of time. This shows when peaks and bottlenecks occur and allows you to identify which data structures in your code are responsible for the largest memory usage of your program.
Documentation explaining how to use Massif is available at the official Massif manual. In short, you should run your executable as follows:
valgrind --tool=massif my_executable
The memory profiling data will be output into a file called
massif.out.pid
, where pid is the runtime process ID of your program. A
custom filename can be chosen using the --massif-out-file option
, as
follows:
valgrind --tool=massif --massif-out-file=optional_filename.out my_executable
The output file contains raw profiling statistics. To view a summary
including a graphical plot of memory usage over time, use the ms_print
command as follows:
ms_print massif.out.12345
or, to save to a file:
ms_print massif.out.12345 > massif.analysis.12345
This will show total memory usage over time as well as a breakdown of the top data structures contributing to memory usage at each snapshot where there has been a significant allocation or deallocation of memory.
Using Valgrind4hpc with parallel programs
First, load valgrind4hpc
:
module load valgrind4hpc
To run valgrind4hpc, first reserve the resources you will use with salloc
.
The following reservation request is for 2 nodes (256 physical cores) for 20
minutes on the short queue:
auser@uan01:> salloc --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 \
--time=00:20:00 --partition=standard --qos=short \
--hint=nomultithread \
--distribution=block:block --account=[budget code]
Once your allocation is ready, Use valgrind4hpc to run and profile your
executable. To test an executable called my_executable
that requires two
arguments arg1
and arg2
on 2 nodes and 256 processes, run:
valgrind4hpc --tool=memcheck --num-ranks=256 my_executable -- arg1 arg2
In particular, note the --
separating the executable from the
arguments (this is not necessary if your executable takes no arguments).
Valgrind4hpc only supports certain tools found in valgrind. These are:
memcheck, helgrind, exp-sgcheck, or drd. The
--valgrind-args="arguments"
allows users to use valgrind options not
supported in valgrind4hpc (e.g. --leak-check
) -- note, however, that
some of these options might interfere with valgrind4hpc.
More information on valgrind4hpc can be found in the manual
(man valgrind4hpc
).
STAT
The Stack Trace Analysis Tool (STAT) is a cross-platform debugging tool from the University of Wisconsin-Madison. ATP is based on the same technology as STAT, both are designed to gather and merge stack traces from a running application's parallel processes. The STAT tool can be useful when application seems to be deadlocked or stuck, i.e. they don't crash but they don't progress as expected, and it has been designed to scale to a very large number of processes. Full information on STAT, including use cases, is available at the STAT website.
STAT will attach to a running program and query that program to find out where all the processes in that program currently are. It will then process that data and produce a graph displaying the unique process locations (i.e. where all the processes in the running program currently are). To make this easily understandable it collates together all processes that are in the same place providing only unique program locations for display.
Using STAT on ARCHER2
On the login node, load the cray-stat
module:
module load cray-stat
Then, launch your job using srun
as a background task (by adding an
&
at the end of the command). For example, if you are running an
executable called my_exe
using 256 processes, you would
run:
srun -n 256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL\
--account=[budget code] --partition=standard --qos=standard./my_exe &
Note
This example has set the job time limit to 1 hour -- if you
need longer, change the --time
command.
You will need the Program ID (PID) of the job you have just launched -- the PID is printed to screen upon launch, or you can get it by running:
ps -u $USER
This will present you with a set of text that looks like this:
PID TTY TIME CMD
154296 ? 00:00:00 systemd
154297 ? 00:00:00 (sd-pam)
154302 ? 00:00:00 sshd
154303 pts/8 00:00:00 bash
157150 pts/8 00:00:00 salloc
157152 pts/8 00:00:00 bash
157183 pts/8 00:00:00 srun
157185 pts/8 00:00:00 srun
157191 pts/8 00:00:00 ps
Once your application has reached the point where it hangs, issue the following command (replacing PID with the ID of the first srun task -- in the above example, I would replace PID with 157183):
stat-cl -i PID
You will get an output that looks like this:
STAT started at 2020-07-22-13:31:35
Attaching to job launcher (null):157565 and launching tool daemons...
Tool daemons launched and connected!
Attaching to application...
Attached!
Application already paused... ignoring request to pause
Sampling traces...
Traces sampled!
Resuming the application...
Resumed!
Pausing the application...
Paused!
...
Detaching from application...
Detached!
Results written to $PATH_TO_RUN_DIRECTORY/stat_results/my_exe.0000
Once STAT is finished, you can kill the srun job using scancel
(replacing JID with the job ID of the job you just launched):
scancel JID
You can view the results that STAT has produced using the following command (note that "my_exe" will need to be replaced with the name of the executable you ran):
stat-view stat_results/my_exe.0000/00_my_exe.0000.3D.dot
This produces a graph displaying all the different places within the program that the parallel processes were when you queried them.
Note
To see the graph, you will need to have exported your X display when logging in.
Larger jobs may spend significant time queueing, requiring submission as a batch job. In this case, a slightly different invocation is illustrated as follows:
#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00:00
# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard
# Load additional modules
module load cray-stat
export OMP_NUM_THREADS=1
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
# This environment variable is required
export CTI_SLURM_OVERRIDE_MC=1
# Request that stat sleeps for 3600 seconds before attaching
# to our executable which we launch with command introduced
# with -C:
stat-cl -s 3600 -C srun --unbuffered ./my_exe
If the job is hanging it will continue to run until the wall clock
exceeds the requested time. Use the stat-view
utility to
inspect the results, as discussed above.
ATP
To enable ATP you should load the atp module and set the ATP_ENABLED
environment variable to 1 on the login node:
module load atp
export ATP_ENABLED=1
# Fix for a known issue:
export HOME=${HOME/home/work}
Then, launch your job using srun
as a background task (by adding an
&
at the end of the command). For example, if you are running an
executable called my_exe
using 256 processes, you would
run:
srun -n=256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL \
--account=[budget code] --partition=standard --qos=standard ./my_exe &
Note
This example has set the job time limit to 1 hour -- if you
need longer, change the --time
command.
Once the job has finished running, load the stat
module to view the
results:
module load cray-stat
and view the merged stack trace using:
stat-view atpMergedBT.dot
Note
To see the graph, you will need to have exported your X display when logging in.