ARCHER2 Known Issues
This section highlights known issues on ARCHER2, their potential impacts and any known workarounds. Many of these issues are under active investigation by HPE Cray and the wider service.
This page was last reviewed on 25 Jan 2023
Incorrect answers using LibSci threaded
dgemm (Added 2023-01-25)
If the first matrix argument of
dgemm is transposed with
'T' then the threaded
version can give incorrect results using the
cpe/21.04 module (which is the default as of 25th January 2023).
This issue is fixed in
cpe/21.09. Note that to activate this new
environment requires an explicit change to
addition to loading the newer
cpe module. For details see Using
non-default versions of HPE Cray libraries on
--cpu=freq=X option is not respected when used with
sbatch (Added: 2023-01-18)
If you specify the CPU frequency using the
--cpu-freq option with the
sbatch command (either using the script
method or the
--cpu-freq=X option directly) then this option will not be respected as the default
setting for ARCHER2 (2.0 GHz) will override the option. You should specify the
--cpu-freq option to
instead within the job submission script. i.e.:
srun --cpu-freq=2250000 ...
You can find more information on setting the CPU frequency in the User Guide.
OOM due to memory leak in libfabric (Added: 2022-02-23)
There is an underlying memory leak in the version of libfabric on ARCHER2 (that comes as part of the underlying SLES operating system) which can cause jobs to fail with an OOM (Out Of Memory) error. This issue will be addressed in a future upgrade of the ARCHER2 operating system. You can workaround this issue by setting the following environment variable in your job submission scripts:
This may come with a performance penalty (though in many cases, we have not seen a noticeable performance impact).
If you continue to see OOM errors after setting this environment variable and do not believe that your application should be requesting too much memory then please contact the service desk
Default FFTW library points to Intel Haswell version rather than AMD Rome at runtime (Added: 2022-02-23)
By default, and at runtime, the standard FFTW library version (from the module
cray-fftw) will link to a version of the
FFTW library optimised for the Intel Haswell architecture rather than the AMD EPYC architecture. This does not cause
errors as the instruction set is compatible between the two architectures but may not provide optimal performance. The
performance differences observed have been small (< 5%) but if you want to ensure that applications using the
module use the correct version of the libraries at runtime, you should add the following lines to your job submission
module load cray-fftw export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
The issue arises because the compilers encode a default path to libraries (
/opt/cray/pe/lib64) into the
the executable so that you do not need to load all the library module dependencies at runtime. The libraries that the
executable finds at
/opt/cray/pe/lib64 are soft links to the default versions of the libraries. There is an error in
the soft link to the FFTW libraries in this directory such they point to the Haswell version of the FFTW libraries rather
than the AMD EPYC (Rome) versions of the libraries.
Occasionally user jobs can cause compute nodes to crash (Added: 2021-11-22)
In rare circumstances, it is possible for a user job to crash the compute nodes on which it is running. This is only evident to the user as a failed job: there is no obvious sign that the nodes have crashed. Therefore, if we identify a user whose jobs are causing nodes to crash, we may need to work with them to stop this happening.
The underlying issue is resolved in an update to the compute-node operating system (Shasta Version 1.5), which is expected to be rolled out to the main system early in 2022.
In the meantime, users who experience this issue are advised to try disabling XPMEM in their application. The ARCHER2 CSE team can provide advice on how to do this. You can contact the CSE team via the service desk.
Dask Python package missing dependencies (Added: 2021-11-22)
The Dask Python package is missing some dependencies on the latest Programming Environment (21.09). This can be worked around either by using the default Programming Environment (21.04), or by following the instructions to install dask in your own user space.
Warning when compiling Fortran code with CCE and MPI_F08 interface (Added: 2021-11-18)
When you compile Fortran code using the MPI F08 interface (i.e.
use mpi_f08) using the default version
of CCE (11.0.4) you will see warnings similar to:
use mpi_f08 ^ ftn-1753 crayftn: WARNING INTERFACE_MPI, File = interface_mpi_mod.f90, Line = 8, Column = 7 File "/opt/cray/pe/mpich/8.1.4/ofi/cray/9.1/include/MPI_F08.mod" containing [sub]module information for "MPI_F08" was created with a previous compiler release. It will not be supported by the next major release. It is version 110 from release 9.0.
These warnings can be safely ignored as they do not affect the functioning of the code. If
you wish to avoid the warnings, you can compile using the more recent CCE version (12.0.3)
on the system. To switch to this version, use
module load cpe/21.09 from the default
environment on ARCHER2.
There are several outstanding issues for the centrally installed Research Software:
- PLUMED is not yet available. Currently, we recommend affected users to install a local version of the software.
Users should also check individual software pages, for known limitations/ caveats, for the use of software on the Cray EX platform and Cray Linux Environment.
Issues with RPATH for non-default library versions
When you compile applications against non-default versions of libraries within the HPE
Cray software stack and use the environment variable
CRAY_ADD_RPATH=yes to try and encode
the paths to these libraries within the binary this will not be respected at runtime and
the binaries will use the default versions instead.
The workaround for this issue is to ensure that you set:
at both compile and runtime. For more details on using non-default versions of libraries, see the description in the User and Best Practice Guide
UCX ERROR: ivb_reg_mr
If you are using the UCX layer for MPI communication you may see an error such as:
[1613401128.440695] [nid001192:11838:0] ib_md.c:325 UCX ERROR ibv_reg_mr(address=0xabcf12c0, length=26400, access=0xf) failed: Cannot allocate memory [1613401128.440768] [nid001192:11838:0] ucp_mm.c:137 UCX ERROR failed to register address 0xabcf12c0 mem_type bit 0x1 length 26400 on md=mlx5_0: Input/output error (md reg_mem_types 0x15) [1613401128.440773] [nid001192:11838:0] ucp_request.c:269 UCX ERROR failed to register user buffer datatype 0x8 address 0xabcf12c0 len 26400: Input/output error MPICH ERROR [Rank 1534] [job id 114930.0] [Mon Feb 15 14:58:48 2021] [unknown] [nid001192] - Abort(672797967) (rank 1534 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error) aborting job: Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error) [1613401128.457254] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e09 apid 200002e3e is not released, refcount 1 [1613401128.457261] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e08 apid 100002e3e is not released, refcount 1
You can add the following line to your job submission script before the
to try and workaround this error:
Setting this flag may have an impact on code performance.
AOCC compiler fails to compile with NetCDF (Added: 2021-11-18)
There is currently a problem with the module file which means cray-netcdf-hdf5parallel will not operate correctly in PrgEnv-aocc. An example of the error seen is:
F90-F-0004-Corrupt or Old Module file /opt/cray/pe/netcdf-hdf5parallel/126.96.36.199/crayclang/9.1/include/netcdf.mod (netcdf.F90: 8)
The current workaround for this is to load module epcc-netcdf-hdf5parallel instead if PrgEnv-aocc is required.
--export option does not work in job submission script
--export=ALL propagates all the environment variables from the login node to the compute node. If you include the option in the job submission script, it is wrongly ignored by Slurm. The current workaround is to include the option when the job submission script is launched. For instance:
sbatch --export=ALL myjob.slurm