NAMD

NAMD is an award-winning parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

Useful Links

Using NAMD on ARCHER2

NAMD is freely available to all ARCHER2 users.

ARCHER2 has two versions of NAMD available: no-SMP (namd/2.14-nosmp) or SMP (namd/2.14). The SMP (Shared Memory Parallelism) build of NAMD introduces threaded parallelism to address memory limitations. The no-SMP build will typically provide the best performance but most users will require SMP in order to cope with high memory requirements.

Important

The namd modules reset the CPU frequency to the highest possible value (2.25 GHz) as this generally achieves the best balance of performance to energy use. You can change this setting by following the instructions in the Energy use section of the User Guide.

Running MPI only NAMD jobs

Using no-SMP NAMD will run jobs with only MPI processes and will not introduce additional threaded parallelism. This is the simplest approach to running NAMD jobs and is likely to give the best performance unless simulations are limited by high memory requirements.

The following script will run a pure MPI NAMD MD job using 4 nodes (i.e. 128x4 = 512 MPI parallel processes).

#!/bin/bash

# Request four nodes to run a job of 512 MPI tasks with 128 MPI
# tasks per node, here for maximum time 20 minutes.

#SBATCH --job-name=namd-nosmp
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=00:20:00

# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard

module load namd/2.14-nosmp

# Ensure the cpus-per-task option is propagated to srun commands
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun --distribution=block:block --hint=nomultithread namd2 input.namd

Running SMP NAMD jobs

If your jobs runs out of memory, then using the SMP version of NAMD will reduce the memory requirements. This involves launching a combination of MPI processes for communication and worker threads which perform computation.

The following script will run a SMP NAMD MD job using 4 nodes with 8 MPI communication processes per node and 16 worker threads per communication process (i.e. a fully-occupied node with all 512 cores populated with processes).

#!/bin/bash
#SBATCH --job-name=namd-smp
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=4
#SBATCH --nodes=4
#SBATCH --time=00:20:00

# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard

# Load the relevant modules
module load namd

# Set procs per node (PPN) & OMP_NUM_THREADS
export PPN=$(($SLURM_CPUS_PER_TASK-1))
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores

# Record PPN in the output file
echo "Number of worker threads PPN = $PPN"

# Run NAMD
srun --distribution=block:block --hint=nomultithread namd2 +setcpuaffinity +ppn $PPN input.namd

Important

Please do not set SRUN_CPUS_PER_TASK when running the SMP version of NAMD. Otherwise, Charm++ will be unable to pin processes to CPUs, causing NAMD to abort with errors such as Couldn't bind to cpuset 0x00000010,,,0x0: Invalid argument.

How do I choose an optimal choice of MPI processes and worker threads for my simulations? The optimal choice for the numbers of MPI processes and worker threads per node depends on the data set and the number of compute nodes. Before running large production jobs, it is worth experimenting with these parameters to find the optimal configuration for your simulation.

We recommend that users match the ARCHER2 NUMA architecture to find the optimal balance of thread and process parallelism. The NUMA levels on ARCHER2 compute nodes are: 4 cores per CCX, 8 cores per CCD, 16 cores per memory controller, 64 cores per socket. For example, the above submission script specifies 32 MPI communication processes per node and 4 worker threads per communication process which places 1 MPI process per CCX on each node.

Note

To ensure fully occupied nodes with the SMP build of NAMD and match the NUMA layout, the optimal values of (tasks-per-node, cpus-per-task) are likely to be (32,4), (16,8) or (8,16).

How do I choose a value for the +ppn flag? The number of workers per communication process is specified by the +ppn argument to NAMD, which is set here to equal cpus-per-task - 1, to leave a CPU-core free for the associated MPI process.

We recommend that users reserve a thread per process to improve the scalability. Reserving this thread on a many-cores-per-node architecture like ARCHER2 will reduce the communication between threads and improve the scalability.

Compiling NAMD

The latest instructions for building NAMD on ARCHER2 may be found in the GitHub repository of build instructions.

ARCHER2 Full System