# Running jobs on ARCHER2¶

Warning

The ARCHER2 Service is not yet available. This documentation is in development.

As with most HPC services, ARCHER2 uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. ARCHER2 uses the Slurm software to schedule jobs.

Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.

Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.

Hint

If you have any questions on how to run jobs on ARCHER2 do not hesitate to contact the ARCHER2 Service Desk.

You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.

## Basic Slurm commands¶

There are three key commands used to interact with the Slurm on the command line:

• sinfo - Get information on the partitions and resources available
• sbatch jobscript.slurm - Submit a job submission script (in this case called: jobscript.slurm) to the scheduler
• squeue - Get the current status of jobs submitted to the scheduler
• scancel 12345 - Cancel a job (in this case with the job ID 12345)

We cover each of these commands in more detail below.

### sinfo: information on resources¶

sinfo is used to query information about available resources and partitions. Without any options, sinfo lists the status of all resources and partitions, e.g.

sinfo

PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard           up 2-00:00:00      1  fail* cn580

### Submitting a job array¶

Job arrays are submitted using sbatch in the same way as for standard jobs:

sbatch job_script.pbs

## Job chaining¶

Job dependencies can be used to construct complex pipelines or chain together long simulations requiring multiple steps.

Note

The --parsable option to sbatch can simplify working with job dependencies. It returns the job ID in a format that can be used as the input to other commands.

For example:

jobid=$(sbatch --parsable first_job.sh) sbatch --dependency=afterok:$jobid second_job.sh

or for a longer chain:

### Long Running Jobs¶

Simulations which must run for a long period of time achieve the best throughput when composed of many small jobs using a checkpoint and restart method chained together (see above for how to chain jobs together). However, this method does occur a startup and shutdown overhead for each job as the state is saved and loaded so you should experiment to find the best balance between runtime (long runtimes minimise the checkpoint/restart overheads) and throughput (short runtimes maximise throughput).

### Large Jobs¶

Large jobs may take longer to start up. The sbcast command is recommended for large jobs requesting over 1500 MPI tasks. By default, Slurm reads the executable on the allocated compute nodes from the location where it is installed; this may take long time when the file system (where the executable resides) is slow or busy. The sbcast command, the executable can be copied to the /tmp directory on each of the compute nodes. Since /tmp is part of the memory on the compute nodes, it can speed up the job startup time.

sbcast --compress=lz4 /path/to/exe /tmp/exe
srun /tmp/exe

### Network Locality¶

For jobs which are sensitive to interconnect (MPI) performance and utilize less than or equal to 256 nodes it is possible to request that all nodes are in a single Slingshot dragonfly group.

Slurm has a concept of “switches” which on ARCHER2 are configured to map to Slingshot groups (there are 256 nodes per group). Since this places an additional constraint on the scheduler a maximum time to wait for the requested topology can be specified. For example:

sbatch --switches=1@60 job.sh

### Process Placement¶

Several mechanisms exist to control process placement on ARCHER2. Application performance can depend strongly on placement depending on the communication pattern and other computational characteristics.

#### Default¶

The default is to place MPI tasks sequentially on nodes until the maximum number of tasks is reached:

salloc: Granted job allocation 24236
salloc: Waiting for resource configuration
salloc: Nodes cn13 are ready for job

srun --cpu-bind=cores xthi

Hello from rank 0, thread 0, on nid000001. (core affinity = 0,128)
Hello from rank 1, thread 0, on nid000001. (core affinity = 16,144)
Hello from rank 2, thread 0, on nid000002. (core affinity = 0,128)
Hello from rank 3, thread 0, on nid000002. (core affinity = 16,144)
Hello from rank 4, thread 0, on nid000003. (core affinity = 0,128)
Hello from rank 5, thread 0, on nid000003. (core affinity = 16,144)
Hello from rank 6, thread 0, on nid000004. (core affinity = 0,128)
Hello from rank 7, thread 0, on nid000004. (core affinity = 16,144)
Hello from rank 8, thread 0, on nid000005. (core affinity = 0,128)
Hello from rank 9, thread 0, on nid000005. (core affinity = 16,144)
Hello from rank 10, thread 0, on nid000006. (core affinity = 0,128)
Hello from rank 11, thread 0, on nid000006. (core affinity = 16,144)
Hello from rank 12, thread 0, on nid000007. (core affinity = 0,128)
Hello from rank 13, thread 0, on nid000007. (core affinity = 16,144)
Hello from rank 14, thread 0, on nid000008. (core affinity = 0,128)
Hello from rank 15, thread 0, on nid000008. (core affinity = 16,144)

#### MPICH_RANK_REORDER_METHOD¶

The MPICH_RANK_REORDER_METHOD environment variable is used to specify other types of MPI task placement. For example, setting it to 0 results in a round-robin placement:

salloc: Granted job allocation 24236
salloc: Waiting for resource configuration
salloc: Nodes cn13 are ready for job

export MPICH_RANK_REORDER_METHOD=0
srun --cpu-bind=cores xthi

Hello from rank 0, thread 0, on nid000001. (core affinity = 0,128)
Hello from rank 1, thread 0, on nid000002. (core affinity = 0,128)
Hello from rank 2, thread 0, on nid000003. (core affinity = 0,128)
Hello from rank 3, thread 0, on nid000004. (core affinity = 0,128)
Hello from rank 4, thread 0, on nid000005. (core affinity = 0,128)
Hello from rank 5, thread 0, on nid000006. (core affinity = 0,128)
Hello from rank 6, thread 0, on nid000007. (core affinity = 0,128)
Hello from rank 7, thread 0, on nid000008. (core affinity = 0,128)
Hello from rank 8, thread 0, on nid000001. (core affinity = 16,144)
Hello from rank 9, thread 0, on nid000002. (core affinity = 16,144)
Hello from rank 10, thread 0, on nid000003. (core affinity = 16,144)
Hello from rank 11, thread 0, on nid000004. (core affinity = 16,144)
Hello from rank 12, thread 0, on nid000005. (core affinity = 16,144)
Hello from rank 13, thread 0, on nid000006. (core affinity = 16,144)
Hello from rank 14, thread 0, on nid000007. (core affinity = 16,144)
Hello from rank 15, thread 0, on nid000008. (core affinity = 16,144)

There are other modes available with the MPICH_RANK_REORDER_METHOD environment variable, including one which lets the user provide a file called MPICH_RANK_ORDER which contains a list of each task’s placement on each node. These options are described in detail in the intro_mpi man page.

grid_order

For MPI applications which perform a large amount of nearest-neighbor communication, e.g., stencil-based applications on structured grids, Cray provides a tool in the perftools-base module called grid_order which can generate a MPICH_RANK_ORDER file automatically by taking as parameters the dimensions of the grid, core count, etc. For example, to place MPI tasks in row-major order on a Cartesian grid of size $(4, 4, 4)$, using 32 tasks per node:

grid_order -R -c 32 -g 4,4,4

# grid_order -R -Z -c 32 -g 4,4,4
# Region 3: 0,0,1 (0..63)
0,1,2,3,16,17,18,19,32,33,34,35,48,49,50,51,4,5,6,7,20,21,22,23,36,37,38,39,52,53,54,55
8,9,10,11,24,25,26,27,40,41,42,43,56,57,58,59,12,13,14,15,28,29,30,31,44,45,46,47,60,61,62,63

One can then save this output to a file called MPICH_RANK_ORDER and then set MPICH_RANK_REORDER_METHOD=3 before running the job, which tells Cray MPI to read the MPICH_RANK_ORDER file to set the MPI task placement. For more information, please see the man page man grid_order (available when the perftools-base module is loaded).

### Huge pages¶

Huge pages are virtual memory pages which are bigger than the default page size of 4K bytes. Huge pages can improve memory performance for common access patterns on large data sets since it helps to reduce the number of virtual to physical address translations when compared to using the default 4KB.

To use huge pages for an application (with the 2 MB huge pages as an example):

cc -o mycode.exe mycode.c

And also load the same huge pages module at runtime.

Warning

Due to the huge pages memory fragmentation issue, applications may get Cannot allocate memory warnings or errors when there are not enough hugepages on the compute node, such as:

libhugetlbfs [nid0000xx:xxxxx]: WARNING: New heap segment map at 0x10000000 failed: Cannot allocate memory

By default, The verbosity level of libhugetlbfs HUGETLB_VERBOSE is set to 0 on ARCHER2 to surpress debugging messages. Users can adjust this value to obtain more information on huge pages use.

#### When to Use Huge Pages¶

• For MPI applications, map the static data and/or heap onto huge pages.
• For an application which uses shared memory, which needs to be concurrently registered with the high speed network drivers for remote communication.
• For SHMEM applications, map the static data and/or private heap onto huge pages.
• For applications written in Unified Parallel C, Coarray Fortran, and other languages based on the PGAS programming model, map the static data and/or private heap onto huge pages.
• For an application doing heavy I/O.
• To improve memory performance for common access patterns on large data sets.

#### When to Avoid Huge Pages¶

• Applications sometimes consist of many steering programs in addition to the core application. Applying huge page behavior to all processes would not provide any benefit and would consume huge pages that would otherwise benefit the core application. The runtime environment variable HUGETLB_RESTRICT_EXE can be used to specify the susbset of the programs to use hugepages.
• For certain applications if using hugepages either causes issues or slows down performance. One such example is that when an application forks more subprocesses (such as pthreads) and these threads allocate memory, the newly allocated memory are the default 4 KB pages.