ARCHER2 Known Issues
This section highlights known issues on ARCHER2, their potential impacts and any known workarounds. Many of these issues are under active investigation by HPE Cray and the wider service.
Singularity and CMake
The issue concerns the building of a cmake-compiled code in bind mode in Singularity containers. This fails because CMake 3.x, running within the container on a UAN, cannot find the MPI libraries (cray-mpich) on the host despite the correct paths having been provided.
There are several outstanding issues for the centrally installed Research Software:
- ChemShell and PyChemShell are not yet available. We are working with the code developers to address this.
- Climate Data Operators (CDO) and NCAR Command Language (NCL) are not yet available, due to issues with dependencies. An investigation is on-going.
- Paraview is not yet available. We hope to provide a suitable installation in the near future.
- VMD is not yet available. We hope to provide a suitable installation in the near future.
- PLUMED is not yet available. Currently, we recommend affected users to install a local version of the software.
Users should also check individual software pages, for known limitations/ caveats, for the use of software on the Cray EX platform and Cray Linux Environment.
stat-view not working
stat-view utility from the
cray-stat module does not currently
work due to missing dependencies within the HPE Cray software stack. If you
try to use the tool, you will see errors similar too:
auser@uan01:~> module load cray-stat auser@uan01:~> stat-view Traceback (most recent call last): File "/opt/cray/pe/stat/4.7.1/lib/python3.6/site-packages/STATview.py", line 60, in <module> import xdot ModuleNotFoundError: No module named 'xdot' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/cray/pe/stat/4.7.1/lib/python3.6/site-packages/STATmain.py", line 73, in <module> raise import_exception File "/opt/cray/pe/stat/4.7.1/lib/python3.6/site-packages/STATmain.py", line 40, in <module> from STATGUI import STATGUI_main File "/opt/cray/pe/stat/4.7.1/lib/python3.6/site-packages/STATGUI.py", line 37, in <module> import STATview File "/opt/cray/pe/stat/4.7.1/lib/python3.6/site-packages/STATview.py", line 62, in <module> raise Exception('STATview requires xdot\nxdot can be downloaded from https://github.com/jrfonseca/xdot.py\n') Exception: STATview requires xdot xdot can be downloaded from https://github.com/jrfonseca/xdot.py
The current workaround is to use
Issues with RPATH for non-default library versions
When you compile applications against non-default versions of libraries within the HPE
Cray software stack and use the environment variable
CRAY_ADD_RPATH=yes to try and encode
the paths to these libraries within the binary this will not be respected at runtime and
the binaries will use the default versions instead.
Memory leak leads to job fail by out of memory (OOM) error
Your program compiles and seems to run fine, but after some time (at least 10 minutes), it crashes with an out-of-memory (OOM) error. The job crashes more quickly when run on a smaller number of nodes.
Known workaround This may be caused by a known problem with the default version of MPICH, which is in the process of being resolved. In the meantime, you can get your jobs to run by switching to UCX MPICH instead. In your Slurm submission script, you will need to remove the following line:
module load epcc-job-env
Instead, you will need to load the programming environment used to compile your program, and switch from the default (OFI) version of MPICH to the UCX version of MPICH. So, if your program was compiled using the GNU programming environment, you should include the following lines in your Slurm submission script before loading any other module:
module restore /etc/cray-pe.d/PrgEnv-gnu module switch cray-mpich/8.0.16 cray-mpich-ucx/8.0.16 module switch craype-network-ofi craype-network-ucx
If your program was compiled in a different programming environment, make sure to change the first line to load the correct environment.
UCX ERROR: ivb_reg_mr
If you are using the UCX layer for MPI communication you may see an error such as:
[1613401128.440695] [nid001192:11838:0] ib_md.c:325 UCX ERROR ibv_reg_mr(address=0xabcf12c0, length=26400, access=0xf) failed: Cannot allocate memory [1613401128.440768] [nid001192:11838:0] ucp_mm.c:137 UCX ERROR failed to register address 0xabcf12c0 mem_type bit 0x1 length 26400 on md=mlx5_0: Input/output error (md reg_mem_types 0x15) [1613401128.440773] [nid001192:11838:0] ucp_request.c:269 UCX ERROR failed to register user buffer datatype 0x8 address 0xabcf12c0 len 26400: Input/output error MPICH ERROR [Rank 1534] [job id 114930.0] [Mon Feb 15 14:58:48 2021] [unknown] [nid001192] - Abort(672797967) (rank 1534 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error) aborting job: Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error) [1613401128.457254] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e09 apid 200002e3e is not released, refcount 1 [1613401128.457261] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e08 apid 100002e3e is not released, refcount 1
You can add the following line to your job submission script before the
to try and workaround this error:
Setting this flag may have an impact on code performance.
Recently Resolved Issues
Job failures with
If you see failures with an error message similar to:
MPICH ERROR [Rank 7] [job id 65233.0] [Mon Jan 11 16:15:34 2021] [unknown] [nid001066] - Abort(1115279) (rank 7 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(632)......: MPID_Init(516).............: MPIDI_NM_mpi_init_hook(575): **ofid_getinfo ofi_init.h 575 MPIDI_NM_mpi_init_hook No data available ...
this indicates an problem with the node. Please contact the ARCHER2 Service Desk with the job ID for your failed job.