Distributed memory

Distributed memory jobs run across multiple compute nodes. A key feature is that the processes only have access to the memory on the node on which they run. Typically these jobs use some form of message passing library to co-ordinate and comunicate. For example MPI.

MPI jobs can be launched on DIaL3, using either mpirun/mpiexec or srun. mpiexec/mpirun are usually provided by the particular MPI implementation used, srun is provided by the SLURM scheduler.

Openmpi - mpirun

#!/bin/bash
#SBATCH --job-name="mpirun_test"
#SBATCH --output=mpirun_test_o.%j
#SBATCH --error=mpirun_test_e.%j

#single node, 128 tasks
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128

# Run for ten minutes
#SBATCH --time=00:10:00
#SBATCH -A account_name

module load gcc/10.3.0
module load openmpi/4.0.5
module load openblas/0.3.15
mpirun ./mpi_hello_world

Openmpi - srun

#!/bin/bash
#SBATCH --job-name="mpirun_test"
#SBATCH --output=mpirun_test_o.%j
#SBATCH --error=mpirun_test_e.%j

#single node, 128 tasks
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128

# Run for ten minutes
#SBATCH --time=00:10:00
#SBATCH -A account_name

module load gcc/10.3.0
module load openmpi/4.0.5
module load openblas/0.3.15
srun ./mpi_hello_world

Intel MPI

Launching jobs using intel MPI (both parallel studio and oneapi) with mpirun does not require any additional work other than loading the relevent mpi modules. The sample below should work with any of the installed intel compilers and mpi.

#!/bin/sh
#SBATCH --job-name="intelmpi_test"
#SBATCH --output=intelmpi_test_o.%j
#SBATCH --error=intelmpi_test_e.%j

#SBATCH -A account_name
#SBATCH -t 00:10:00
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128

module purge
module load intel-oneapi-compilers/2021.2.0
module load intel-oneapi-mpi/2021.4.0

mpirun ./mpi_hello_intel_2021

If using srun to launch your intel mpi codes, additional environment variables need to be set.

Intel parallel studio - srun

#!/bin/bash
#SBATCH --job-name="intel_mpirun_test"
#SBATCH --output=intel_mpirun_test_o.%j
#SBATCH --error=intel_mpirun_test_e.%j
#SBATCH -A account_name
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128

# Clear and load modules
module purge
module load intel-parallel-studio/cluster.2019.5

# before calling srun, need to set I_MPI_LIBRARY to point to the correct pmi2 library 
export I_MPI_LIBRARY=/usr/lib64/pmix/lib/libpmi2.so
srun --mpi=pmi2 mpi_hello_intel_2019

Intel oneapi - srun

#!/bin/bash
#SBATCH --job-name="intel_mpirun_test"
#SBATCH --output=intel_mpirun_test_o.%j
#SBATCH --error=intel_mpirun_test_e.%j
#SBATCH -A account_name
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128

# purge and load modules
module purge
module load intel-oneapi-compilers/2021.2.0
module load intel-oneapi-mpi/2021.4.0

# Setup environment for intel mpi
export I_MPI_OFI_LIBRARY_INTERNAL=0
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
export SLURM_MPI_TYPE=pmi2

# call srun to launch processes
srun --mpi=pmi2 mpi_hello_intel_2021

Cray MPI

The cray mpi implimentation does not include an mpirun, so jobs should be launched using srun:

#!/bin/bash
#SBATCH --job-name="test_cray_mpi"
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128
#SBATCH --output=test_cray_mpi_o.%j
#SBATCH --output=test_cray_mpi_e.%j
#SBATCH --error=rfm_HelloWorldMPI_job.err
#SBATCH -A project_account

# load cray modules
module purge
module load PrgEnv-cray/8.0.0
module load cray-pmi
module load cray-fftw/3.3.8.8

# launch program with srun rather than mpi
srun ./mpi_hello_world

Hybrid MPI/OpenMP Jobs

#!/bin/bash -l

#######################################
# example for a hybrid MPI OpenMP job #
#######################################

#SBATCH --job-name=your_job_name
#SBATCH -o output_file_name%j
#SBATCH -e error_file_name%j
#SBATCH -p partition_name
#SBATCH -A account_name

# we ask for 8 MPI tasks with 32 cores each
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32

# run for fifteen minutes hh:mm:ss
#SBATCH --time=00:15:00
#SBATCH --mail-type=ALL

module purge
module load gcc/10.3.0
module load openmpi/4.0.5

# we set OMP_NUM_THREADS to the number cpu cores per MPI task
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# we execute the job and time it
time mpirun -np $SLURM_NTASKS ./my_binary.x

# or if using srun:
#srun ./my_binary.x

# after the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp ${SCRATCH_DIRECTORY}/my_output ${SLURM_SUBMIT_DIR}

# we step out of the scratch directory and remove it
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}

# Finish the script and Exit.
exit 0