NAMD is an award-winning parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
Using NAMD on ARCHER2
NAMD is freely available to all ARCHER2 users.
ARCHER2 has two versions of NAMD available: no-SMP (
or SMP (
namd/2.14). The SMP (Shared Memory Parallelism) build of NAMD
introduces threaded parallelism to address memory limitations. The no-SMP
build will typically provide the best performance but most users will require
SMP in order to cope with high memory requirements.
namd modules reset the CPU frequency to the highest possible value
(2.25 GHz) as this generally achieves the best balance of performance to
energy use. You can change this setting by following the instructions in the
Energy use section of the User Guide.
Running MPI only NAMD jobs
Using no-SMP NAMD will run jobs with only MPI processes and will not introduce additional threaded parallelism. This is the simplest approach to running NAMD jobs and is likely to give the best performance unless simulations are limited by high memory requirements.
The following script will run a pure MPI NAMD MD job using 4 nodes (i.e. 128x4 = 512 MPI parallel processes).
#!/bin/bash # Request four nodes to run a job of 512 MPI tasks with 128 MPI # tasks per node, here for maximum time 20 minutes. #SBATCH --job-name=namd-nosmp #SBATCH --nodes=4 #SBATCH --ntasks-per-node=128 #SBATCH --cpus-per-task=1 #SBATCH --time=00:20:00 # Replace [budget code] below with your project code (e.g. t01) #SBATCH --account=[budget code] #SBATCH --partition=standard #SBATCH --qos=standard module load namd/2.14-nosmp # Ensure the cpus-per-task option is propagated to srun commands export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK srun --distribution=block:block --hint=nomultithread namd2 input.namd
Running SMP NAMD jobs
If your jobs runs out of memory, then using the SMP version of NAMD will reduce the memory requirements. This involves launching a combination of MPI processes for communication and worker threads which perform computation.
The following script will run a SMP NAMD MD job using 4 nodes with 8 MPI communication processes per node and 16 worker threads per communication process (i.e. a fully-occupied node with all 512 cores populated with processes).
#!/bin/bash #SBATCH --job-name=namd-smp #SBATCH --ntasks-per-node=32 #SBATCH --cpus-per-task=4 #SBATCH --nodes=4 #SBATCH --time=00:20:00 # Replace [budget code] below with your project code (e.g. t01) #SBATCH --account=[budget code] #SBATCH --partition=standard #SBATCH --qos=standard # Load the relevant modules module load namd # Set procs per node (PPN) & OMP_NUM_THREADS export PPN=$(($SLURM_CPUS_PER_TASK-1)) export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export OMP_PLACES=cores # Record PPN in the output file echo "Number of worker threads PPN = $PPN" # Run NAMD srun --distribution=block:block --hint=nomultithread namd2 +setcpuaffinity +ppn $PPN input.namd
Please do not set
SRUN_CPUS_PER_TASK when running the SMP version of NAMD.
Otherwise, Charm++ will be unable to pin processes to CPUs, causing NAMD to abort
with errors such as
Couldn't bind to cpuset 0x00000010,,,0x0: Invalid argument.
How do I choose an optimal choice of MPI processes and worker threads for my simulations? The optimal choice for the numbers of MPI processes and worker threads per node depends on the data set and the number of compute nodes. Before running large production jobs, it is worth experimenting with these parameters to find the optimal configuration for your simulation.
We recommend that users match the ARCHER2 NUMA architecture to find the optimal balance of thread and process parallelism. The NUMA levels on ARCHER2 compute nodes are: 4 cores per CCX, 8 cores per CCD, 16 cores per memory controller, 64 cores per socket. For example, the above submission script specifies 32 MPI communication processes per node and 4 worker threads per communication process which places 1 MPI process per CCX on each node.
To ensure fully occupied nodes with the SMP build of NAMD and match the NUMA
layout, the optimal values of (
cpus-per-task) are likely
to be (32,4), (16,8) or (8,16).
How do I choose a value for the +ppn flag? The number of workers per communication process is specified by the +ppn argument to NAMD, which is set here to equal cpus-per-task - 1, to leave a CPU-core free for the associated MPI process.
We recommend that users reserve a thread per process to improve the scalability. Reserving this thread on a many-cores-per-node architecture like ARCHER2 will reduce the communication between threads and improve the scalability.
The latest instructions for building NAMD on ARCHER2 may be found in the GitHub repository of build instructions.