Native SLURM

Cray systems are now equipped with native SLURM: the main difference is the absence of aprun, which has been replaced by the SLURM command srun.

The migration from Cray ALPS (Application Level Placement Scheduler) to native SLURM is supported by the simple examples available below for the most common usage with MPI and hybrid MPI/OpenMP jobs.
SLURM man pages (e.g man sbatch) will give useful information and more details on specific options, along with the documentation available on line at: http://slurm.schedmd.com/documentation.html.

Advanced users might also be interested in consulting the presentations available on line from the Slurm User Group meeting, which cover the new features of the latest SLURM release: http://slurm.schedmd.com/publications.html.



  #!/bin/bash -l
#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --ntasks-per-core=2
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
srun -n $SLURM_NTASKS --ntasks-per-node=$SLURM_NTASKS_PER_NODE -c $SLURM_CPUS_PER_TASK --cpu_bind=rank ./myprogram.exe

The example above shows an MPI job allocated on two nodes using hyperthreading on Piz Daint.
You might need to use the SLURM option --constraint=gpu to run on the XC50 using the GPU accelerator on each node. With the exception of the gpu flag, the same srun options apply on XC40, where you need to adjust the number of cores of the Broadwell compute node, featuring two sockets with eighteen cores each. The flag -l at the beginning allows you to call the module command within the script, in case you need it.


 #!/bin/bash -l
#SBATCH --time=00:30:00
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=18
#SBATCH --contstraint=mc
srun -n $SLURM_NTASKS --ntasks-per-node=$SLURM_NTASKS_PER_NODE -c $SLURM_CPUS_PER_TASK --hint=nomultithread ./test.mpi

The SLURM script above shows how to run a hybrid job with two MPI tasks per node, spawning eighteen threads per socket on a two-sockets Broadwell compute node.
The srun option --hint=nomultithread will avoid using extra threads with in-core multi-threading, a configuration that can benefit communication intensive applications (see man srun for further details).


export PMI_MMAP_SYNC_WAIT_TIME=300
srun --wait 200 --bcast=/tmp/hello-world.${ARCH} $HOME/jobs/bin/daint/hello-world.${ARCH}

The example above shows what needs to be done to run large MPI jobs:
- PMI_MMAP_SYNC_WAIT_TIME
and srun option --wait prevent Slurm from killing tasks that take long time to run
- srun's option --bcast copies the binary to /tmp on all nodes before launching them. This helps task startup time.

Synoptic table

Option aprun srun
MPI tasks -n -n, --ntasks
MPI tasks per node -N --ntasks-per-node
CPUs per task -d -c, --cpus-per-task
Thread/task affinity -cc cpu --cpu_bind=rank
Large memory nodes -q bigmem --mem=127GB

The other SLURM commands will always be the same: the list of queues and partitions is available typing sinfo or scontrol show partition, the SLURM queue can be monitored with the squeue command and the jobs saved in the SLURM database can be inspected with the sacct command.

If you have any questions, don't hesitate to contact us at help@cscs.ch