Skip to content

Machine Learning

Two Machine Learning (ML) frameworks are supported on ARCHER2, PyTorch and TensorFlow.

For each framework, we'll show how to run a particular MLCommons HPC benchmark. We start with PyTorch.

PyTorch

On ARCHER2, PyTorch is supported for use on both the CPU and GPU nodes.

We'll demonstrate the use of PyTorch with DeepCam, a deep learning climate segmentation benchmark. It involves training a neural network to recognise large-scale weather phenomena (e.g., tropical cyclones, atmospheric rivers) in the output generated by ensembles of weather simulations, see link below for more details.

Exascale Deep Learning for Climate Analytics

There are two DeepCam training datasets available on ARCHER2. A 62 GB mini dataset (/work/z19/shared/mlperf-hpc/deepcam/mini), and a much larger 8.9 TB dataset (/work/z19/shared/mlperf-hpc/deepcam/full).

DeepCam on GPU

As DeepCam is an MLPerf benchmark, you may wish to base a custom python environment on pytorch/2.9.1-gpu so that you have the opportunity to install additional python packages that support MLPerf logging, as well as extra features pertinent to DeepCam (e.g., dynamic learning rates).

The pytorch/2.9.1-gpu module requires Python 3.11.7 as that version of Python is the one installed on the GPU nodes. On the login and CPU nodes however, the latest version of Python is 3.10.10 and so the pytorch/2.9.1-gpu module cannot be loaded from those locations unless you are running within a container that has cray-python/3.11.7 and rocm/6.3.4. Fortunately, there exists a containerised ROCm 6.3.4 image that includes Python 3.11.7.

The commands below instantiate the ROCm container image (see the Containerised ROCm section for more details).

module use /work/y07/shared/archer2-lmod/others/dev
module load ccpe/25.09-rocm-6.3.4

singularity shell --cleanenv --bind ${HOME/home/work}/pyenvs ${CCPE_IMAGE_FILE}

From within the container, we can download the necessary Python wheels that are compatible with Python 3.11. Note, we wouldn't be able to do this from a GPU node as that environment does not have access to the outside world.

The commands for running inside the container are listed below.

source /etc/bash.bashrc.local

module -q load craype-x86-milan
module -q load craype-accel-amd-gfx90a
module -q load rocm
module -q load PrgEnv-gnu
module -q load cray-python

PRFX=${HOME/home/work}/pyenvs
PYVENV_ROOT=${PRFX}/mlperf-pt-gpu

mkdir -p ${PYVENV_ROOT}
cd ${PYVENV_ROOT}

mkdir -p ${PYVENV_ROOT}/wheels-wheel
cd ${PYVENV_ROOT}/wheels-wheel
pip download --no-cache-dir wheel

mkdir -p ${PYVENV_ROOT}/wheels-deepcam
cd ${PYVENV_ROOT}/wheels-deepcam
pip download --no-cache-dir h5py mlperf-logging warmup-scheduler

exit

We've now returned to the login node. Next, we launch a job on the GPU node, to complete the setup of the custom Python environment for DeepCam.

#!/bin/bash

#SBATCH --job-name=deepcam-pyenv
#SBATCH --account=[budget code]
#SBATCH --partition=gpu
#SBATCH --qos=gpu-shd
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=00:20:00
#SBATCH --exclusive


module -q load pytorch/2.9.1-gpu

PYTHON_TAG=python`echo ${CRAY_PYTHON_LEVEL} | cut -d. -f1-2`

PRFX=${HOME/home/work}/pyenvs
PYVENV_ROOT=${PRFX}/mlperf-pt-gpu

python -m venv --system-site-packages ${PYVENV_ROOT}

extend-venv-activate ${PYVENV_ROOT}

source ${PYVENV_ROOT}/bin/activate

export PIP_CACHE_DIR=${PYVENV_ROOT}/.cache/pip

cd ${PYVENV_ROOT}/wheels-wheel
python -m pip install --no-build-isolation *

cd ${PYVENV_ROOT}/wheels-deepcam
python -m pip install --no-build-isolation *

To prepare for running a DeepCam training job, we must clone the MLCommons HPC github repo, which can be done from the login node.

mkdir ${HOME/home/work}/tests
cd ${HOME/home/work}/tests

git clone https://github.com/mlcommons/hpc.git mlperf-hpc

cd ./mlperf-hpc/deepcam/src/deepCam

You are now ready to run the following DeepCam submission script via the sbatch command.

#!/bin/bash

#SBATCH --job-name=deepcam
#SBATCH --account=[budget code]
#SBATCH --partition=gpu
#SBATCH --qos=gpu-exc
#SBATCH --nodes=2
#SBATCH --gpus=8
#SBATCH --time=01:00:00
#SBATCH --exclusive


JOB_OUTPUT_PATH=./results/${SLURM_JOB_ID}
mkdir -p ${JOB_OUTPUT_PATH}/logs

source ${HOME/home/work}/pyenvs/mlperf-pt-gpu/bin/activate

export OMP_NUM_THREADS=1
export HOME=${HOME/home/work}

srun --ntasks=8 --tasks-per-node=4 \
     --cpu-bind=verbose,map_cpu:0,8,16,24 --hint=nomultithread \
     python train.py \
         --run_tag test \
         --data_dir_prefix /work/z19/shared/mlperf-hpc/deepcam/mini \
         --output_dir ${JOB_OUTPUT_PATH} \
     --wireup_method nccl-slurm \
     --max_epochs 64 \
     --local_batch_size 1

mv slurm-${SLURM_JOB_ID}.out ${JOB_OUTPUT_PATH}/slurm.out

The job submission script activates the python environment that was setup earlier, but that particular command (source ${HOME/home/work}/pyenvs/mlperf-pt-gpu/bin/activate) could be replaced by module -q load pytorch/2.9.1-gpu if you are not running DeepCam and have no need for additional Python packages such as mlperf-logging and warmup-scheduler.

In the script above, we specify four tasks per node, one for each GPU. These tasks are evenly spaced across the node so as to maximise the communications bandwidth between the host and the GPU devices. Note, PyTorch is not using Cray MPICH for inter-task communications, which is instead being handled by the ROCm Collective Communications Library (RCCL), hence the --wireup_method nccl-slurm option (nccl-slurm works as an alias for rccl-slurm in this context).

The above job should achieve convergence — an Intersection over Union (IoU) of 0.82 — after 35 epochs or so. Runtime should be around 20-30 minutes.

We can also modify the DeepCam train.py script so that the accuracy and loss are logged using TensorBoard.

The following lines must be added to the DeepCam train.py script.

import os
...

from torch.utils.tensorboard import SummaryWriter

...

def main(pargs):

    #init distributed training
    comm_local_group = comm.init(pargs.wireup_method, pargs.batchnorm_group_size)
    comm_rank = comm.get_rank()
    ...

    #set up logging
    pargs.logging_frequency = max([pargs.logging_frequency, 0])
    log_file = os.path.normpath(os.path.join(pargs.output_dir, "logs", pargs.run_tag + ".log"))
    ...

    writer = SummaryWriter()

    #set seed
    ...

    ...

    #training loop
    while True:
        ...

        #training
        step = train_epoch(pargs, comm_rank, comm_size,
                           ...
                           logger, writer)

        ...

The train_epoch function is defined in ./driver/trainer.py and so that file must be amended like so.

...

def train_epoch(pargs, comm_rank, comm_size,
                ...,
                logger, writer):

    ...

    writer.add_scalar("Accuracy/train", iou_avg_train, epoch+1)
    writer.add_scalar("Loss/train", loss_avg_train, epoch+1)

    return step

DeepCam on CPU

PyTorch can also be run on the ARCHER2 CPU nodes. However, since the DeepCam uses the torch.distributed module, we cannot use Horovod to handle (via MPI) inter-task communications. We must instead build PyTorch from source so that we can link torch.distributed to the correct Cray MPICH libraries.

The instructions for doing such a build can be found here, https://github.com/hpc-uk/build-instructions/blob/main/pyenvs/pytorch/build_pytorch_2.9.1a0_from_source_archer2_cpu.md.

This install can be accessed by loading the pytorch/2.9.1a0 module. Please note, PyTorch source version 2.9.1a0 corresponds to PyTorch package version 2.9.1.

Once again, as we are running the DeepCam benchmark, we'll need to setup a local Python environment for installing the MLPerf logging package. This time the local environment is based on the pytorch/2.9.1a0 module.

#!/bin/bash

module -q load pytorch/2.9.1a0

PYTHON_TAG=python`echo ${CRAY_PYTHON_LEVEL} | cut -d. -f1-2`

PRFX=${HOME/home/work}/pyenvs
PYVENV_ROOT=${PRFX}/mlperf-pt
PYVENV_SITEPKGS=${PYVENV_ROOT}/lib/${PYTHON_TAG}/site-packages

mkdir -p ${PYVENV_ROOT}
cd ${PYVENV_ROOT}


python -m venv --system-site-packages ${PYVENV_ROOT}

extend-venv-activate ${PYVENV_ROOT}

source ${PYVENV_ROOT}/bin/activate


mkdir -p ${PYVENV_ROOT}/wheels
cd ${PYVENV_ROOT}/wheels

pip download --no-cache-dir wheel
python -m pip install --no-build-isolation *

pip download --no-cache-dir h5py mlperf-logging warmup-scheduler
python -m pip install --no-build-isolation *


deactivate

In order to run a DeepCam training job, you must first clone the MLCommons HPC github repo.

mkdir ${HOME/home/work}/tests
cd ${HOME/home/work}/tests

git clone https://github.com/mlcommons/hpc.git mlperf-hpc

cd ./mlperf-hpc/deepcam/src/deepCam

Next, we need to edit some parts of the DeepCam Python source such that DeepCam is properly integrated with Cray MPICH.

The init function defined in ./utils/comm.py contains an if statement that initialises the DeepCam job according to the selected communications method. You will need to edit the mpi branch of this if statement as shown below.

...

def init(method, batchnorm_group_size=1):

    if method == "nccl-openmpi":

    ...

    elif method == "mpi":
        rank = int(os.getenv("SLURM_PROCID"))
        world_size = int(os.getenv("SLURM_NTASKS"))
        dist.init_process_group(backend = "mpi",
                                rank = rank,
                                world_size = world_size)

    else:
        raise NotImplementedError()

    ...    

Second, as we're not running on a GPU platform, we'll need to comment out a statement that calls a GPU-based synchronisation method, see the synchronize method within ./utils/bnstats.py.

...

def synchronize(self:

    if dist.is_initialized():
        # sync the device before
        #torch.cuda.synchronize()

    with torch.no_grad():
        ...

DeepCam can now be run on the CPU nodes using a submission script like the one below.

#!/bin/bash

#SBATCH --job-name=deepcam
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --time=10:00:00
#SBATCH --exclusive


JOB_OUTPUT_PATH=./results/${SLURM_JOB_ID}
mkdir -p ${JOB_OUTPUT_PATH}/logs

source ${HOME/home/work}/pyenvs/mlperf-pt/bin/activate

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun --hint=nomultithread \
     python train.py \
         --run_tag test \
         --data_dir_prefix /work/z19/shared/mlperf-hpc/deepcam/mini \
         --output_dir ${JOB_OUTPUT_PATH} \
         --wireup_method mpi \
         --max_inter_threads ${SLURM_CPUS_PER_TASK} \
         --max_epochs 64 \
         --local_batch_size 1

mv slurm-${SLURM_JOB_ID}.out ${JOB_OUTPUT_PATH}/slurm.out

The script above activates the local Python environment so that the mlperf-logging package is available; this is needed by the logger object declared in the DeepCam train.py script. Notice also that the --wireup-method parameter is now set to mpi and that a new parameter has been added, --max_inter_threads, for specifying the maximum number of concurrent readers.

DeepCam performance on the CPU nodes is much slower than GPU. Running on 32 CPU nodes, as shown above, will take around 6 hours to complete 35 epochs. This assumes you're using the default hyperparameter settings for DeepCam.

TensorFlow

On ARCHER2, TensorFlow is supported for use on the CPU nodes only.

We'll demonstrate the use of TensorFlow with the CosmoFlow benchmark. It involves training a neural network to recognise cosmological parameter values from the output generated by 3D dark matter simulations, see link below for more details.

CosmoFlow: using deep learning to learn the universe at scale

There are two CosmoFlow training datasets available on ARCHER2. A 5.6 GB mini dataset (/work/z19/shared/mlperf-hpc/cosmoflow/mini), and a much larger 1.7 TB dataset (/work/z19/shared/mlperf-hpc/cosmoflow/full).

CosmoFlow on CPU

In order to run a CosmoFlow training job, you must first clone the MLCommons HPC github repo.

mkdir ${HOME/home/work}/tests
cd ${HOME/home/work}/tests

git clone https://github.com/mlcommons/hpc.git mlperf-hpc

cd ./mlperf-hpc/cosmoflow

You are now ready to run the following CosmoFlow submission script via the sbatch command.

#!/bin/bash

#SBATCH --job-name=cosmoflow
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --time=01:00:00
#SBATCH --exclusive

module -q load tensorflow/2.13.0

export UCX_MEMTYPE_CACHE=n
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
export MPICH_DPM_DIR=${SLURM_SUBMIT_DIR}/dpmdir

export OMP_NUM_THREADS=16
export TF_ENABLE_ONEDNN_OPTS=1

srun  --hint=nomultithread --distribution=block:block --cpu-freq=2250000 \
    python train.py \
        --distributed --omp-num-threads ${OMP_NUM_THREADS} \
        --inter-threads 0 --intra-threads 0 \
        --n-epochs 2048 --n-train 1024 --n-valid 1024 \
        --data-dir /work/z19/shared/mlperf-hpc/cosmoflow/mini/cosmoUniverse_2019_05_4parE_tf_v2_mini

The CosmoFlow job runs eight MPI tasks per node (one per NUMA region) with sixteen threads per task, and so, each node is fully populated. The TF_ENABLE_ONEDNN_OPTS variable refers to Intel's oneAPI Deep Neural Network library. Within the TensorFlow source there are #ifdef guards that are activated when oneDNN is enabled. It turns out that having TF_ENABLE_ONEDNN_OPTS=1 also improves performance (by a factor of 12) on AMD processors.

The inter/intra thread training parameters allow one to exploit any parallelism implied by the TensorFlow (TF) DNN graph. For example, if a node in the TF graph can be parallelised, the number of threads assigned will be the value of --intra-threads; and, if there are separate nodes in the TF graph that can be run concurrently, the available thread count for such an activity is the value of --inter-threads. Of course, the optimum values for these parameters will depend on the DNN graph. The job script above tells TensorFlow to choose the values by setting both parameters to zero.

You will note that only a few hyperparameters are specified for the CosmoFlow training job (e.g., --n-epochs, --n-train and --n-valid). Those settings in fact override the values assigned to those same parameters within the ./configs/cosmo.yaml file. However, that file contains settings for many other hyperparameters that are not overwritten.

The CosmoFlow job specified above should take around 140 minutes to complete 2048 epochs, which should be sufficient to achieve a mean average error of 0.23.