Eiger - Dalco SM System
EIGER SLURM Batch Queuing System
Starting from Wednesday 07 September 2011, the Visualization, Research & Development Cluster EIGER is operating the SLURM batch queuing system, which is "The Simple Linux Utility for Resource Management" (SLURM), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
The access to EIGER cluster resources is still possible in a batch or in an interactive way, using the SLURM client job submission commands :
- sbatch : submits a batch script to SLURM
- salloc : allocate a SLURM job allocation (set of nodes)
- srun : run a parallel job
The following SLURM end user commands are available in order to check the status of the current SLURM configured partitions (queues), resource (nodes) availability, to control and monitor jobs, to watch resource job usage and get job detailed information, etc. :
- squeue : view information about jobs located in the SLURM scheduling queue
- sinfo : view information about SLURM nodes and partitions
- smap : graphically view information about SLURM jobs, partitions
- sview : graphical user interface to view and modify SLURM state
- scancel : used to signal or cancel jobs or job steps
- scontrol : view SLURM current configuration
- sacct : displays user job accounting data
Further SLURM documentation
SLURM detailed documentation can be found on the official SLURM web site :
EIGER SLURM current configuration
The current batch queuing system configuration offers 2 distinguished partitions (queues) :
- stdMem parallel 13 nodes
- largeMem parallel 8 nodes
The current EIGER resources available via SLURM are summarized here :
Node | Core | Mem | M/C | GPU | GPU# | Cap | GPU-t | GPU-c | GPU-m | GPU-f | GPU-mt |
|---|---|---|---|---|---|---|---|---|---|---|---|
eiger200 | 12 | 24 GB | 2 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger201 | 12 | 24 GB | 2 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger202 | 12 | 24 GB | 2 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger203 | 12 | 24 GB | 2 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger204 | 12 | 24 GB | 2 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger205 | 12 | 24 GB | 2 GB | GTX 480 | 1 | 2.0 | fermi | 480 | 1536 MB | 1.4 Ghz | GDDR5 |
eiger206 | 12 | 24 GB | 2 GB | GTX 480 | 1 | 2.0 | fermi | 480 | 1536 MB | 1.4 Ghz | GDDR5 |
eiger207 | 24 | 48 GB | 2 GB | M2050 | 2 | 2.0 | fermi | 448 | 2.6 GB | 1.15 Ghz | GDDR5 |
eiger208 | 24 | 48 GB | 2 GB | M2050 | 2 | 2.0 | fermi | 448 | 2.6 GB | 1.15 Ghz | GDDR5 |
eiger209 | 24 | 48 GB | 2 GB | C2070 | 2 | 2.0 | fermi | 448 | 5.4 GB | 1.15 Ghz | GDDR5 |
eiger210 | 24 | 48 GB | 2 GB | C2070 | 2 | 2.0 | fermi | 448 | 5.4 GB | 1.15 Ghz | GDDR5 |
eiger220 | 12 | 48 GB | 4 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger221 | 12 | 48 GB | 4 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger222 | 12 | 48 GB | 4 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger223 | 12 | 48 GB | 4 GB | GTX 285 | 1 | 1.3 | geforce | 240 | 2 GB | 1.48 Ghz | GDDR3 |
eiger240 | 12 | 24 GB | 2 GB | S1070 | 2 | 1.3 | tesla | 240 | 4 GB | 1.3 Ghz | GDDR3 |
eiger241 | 12 | 24 GB | 2 GB | S1070 | 2 | 1.3 | tesla | 240 | 4 GB | 1.3 Ghz | GDDR3 |
eiger242 | 12 | 24 GB | 2 GB | C2070 | 2 | 2.0 | fermi | 448 | 5.4 GB | 1.15 Ghz | GDDR5 |
eiger243 | 12 | 24 GB | 2 GB | C2070 | 2 | 2.0 | fermi | 448 | 5.4 GB | 1.15 Ghz | GDDR5 |
eiger180 | 12 | 24 GB | 2 GB | GTX 480 | 1 | 2.0 | fermi | 480 | 1536 MB | 1.4 Ghz | GDDR5 |
eiger181 | 12 | 24 GB | 2 GB | GTX 480 | 1 | 2.0 | fermi | 480 | 1536 MB | 1.4 Ghz | GDDR5 |
GENERIC RESOURCE SCHEDULING (Gres)
SLURM is configured on EIGER to support the Gres scheduling for Graphics Processing Units (GPUs). See here for further technical details :
Two SLURM parameters can be used when submitting jobs in the system (via srun, sbatch or salloc), in order to select the appropriate EIGER node with the desired GPU resources :
- --gres=gpu:N, where N=[1|2]
- --constraint=[geforce|gtx285|fermi|gtx480|m2050|c2070|s1070|tesla|m2090|nvidia]
The --gres parameters controls the selection of how many gpus on a node a user need, while the --constraint parameter is used for the GPU type selection. Only one value must be provided in order to target the appropriate GPU type selection.
- --constraint=nvidia
We recommend to specify as value for the --constraint parameter, exactly a particular kind of GPU type needed (s1070,c2070,m2050,m2090,gtx480,gtx285,hd6970), since your GPU-side application code (kernel) could be better optimized on a particular GPU type, taking advantage from the specific GPU architecture and features (NVIDIA, Fermi vs non-Fermi GPUs, etc.)
SLURM on EIGER
In order to start familiarize with SLURM on EIGER, we suggest you to follow the instructions steps reported here below :
See the above table for the available resource list for both nodes.
1. Connect to the frontend server machine eiger.cscs.ch via SSH protocol, by first connecting to the CSCS main entry point ELA (ela.cscs.ch) :
ssh -Y ela.cscs.ch
ssh -Y eiger.cscs.ch
2. Start becoming familiar with the end-user SLURM commands :
squeue -a -l -u $USER
sinfo -a -l
scontrol -a show partition
scontrol show nodes
scontrol -a show job
sview &
3. Here below follow 3 examples of typical SLURM job submission templates. Try them with your typical application code and don't hesitate to report problems to help@cscs.ch by specifying in the subject line the keywords "EIGER: SLURM migration problems".
Simple SLURM interactive job request :
You need for your interactive session an EIGER compute node with at least 23 GB of main memory and 1 cpu for 4 hours, having an NVIDIA Geforce GTX 285 graphics card:
srun -N 1 -n 1 --gres=gpu:1 --constraint=gtx285 --mem=23g --time=04:00:00 --pty /bin/bash
OR alternatively in case you need interactive access to multiple compute nodes (-N 2), you can reserve them by placing an allocation and then connect to the allocated nodes via SSH :
salloc -N 2 -n 1 --gres=gpu:2 --constraint=gtx285 --mem=23g --time=04:00:00
scontrol show job JOBID
ssh -Y eigerXXX.login.cscs.ch
ssh -Y eigerYYY.login.cscs.ch
And once your work is terminated, release the nodes allocation, exiting from the shell where the salloc was launched.
Simple SLURM batch script for serial jobs :
You need to submit a SLURM serial job requesting 1 cpu, 12 GB of main memory, 1 GPU of type FERMI for 6 h and 30 minutes :
simple-serial.sh :
#!/bin/bash
#SBATCH --job-name="SLURM TEST JOB"
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --constraint=fermi
#SBATCH --mem=12g
#SBATCH --time=06:30:00
#SBATCH --partition=stdMem
#SBATCH --account=$GROUP
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@MAIL.DOMAIN
#SBATCH --output=/users/$USER/slurm-OUT.log
#SBATCH --error=/users/$USER/slurm-ERR.log
#======START=============================================
. /etc/profile.d/modules.bash
# Load your required module files here
module load MODULE_TO_BE_LOADED
echo "On which node your job has been scheduled :"
echo $SLURM_JOB_NODELIST
echo "Print current shell limits :"
ulimit -a
echo "Now run your serial tasks..."
PLACE YOUR SERIAL CODE HERE
#======END===============================================
Submit now your SLURM serial job :
sbatch simple-serial.sh
Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive
Simple SLURM batch script for parallel-MPI jobs :
You need to submit a parallel-MPI SLURM job requesting 4 FERMI nodes and 48 mpi tasks in total, 12 mpi tasks running per node, each one associated to exactly one cpu-core in a node, and needing a total of 32 GB of memory, 8 GB per node (2 GB per mpi task):
simple-parallel.sh :
#!/bin/bash
#SBATCH --job-name="SLURM TEST JOB"
#SBATCH --nodes=4
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=12
#SBATCH --gres=gpu:1
#SBATCH --constraint=fermi
#SBATCH --mem=8g
#SBATCH --time=01:30:00
#SBATCH --partition=stdMem
#SBATCH --account=$GROUP
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@MAIL.DOMAIN
#SBATCH --output=/users/$USER/slurm-OUT.log
#SBATCH --error=/users/$USER/slurm-ERR.log
#======START===============================
. /etc/profile.d/modules.bash
module load mvapich2/1.7
echo "On which nodes it executes :"
echo $SLURM_JOB_NODELIST
echo "Which MPI Implementation is used :"
which mpiexec.hydra
mpiexec.hydra -info
echo "Print current shell limits :"
ulimit -a
echo "Now run the MPI tasks..."
mpiexec.hydra -rmk slurm -ppn 12 /users/$USER/MY_MPI_EXEC
#======END=================================
Now submit your SLURM Parallel-MPI job :
sbatch simple-parallel.sh
Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive
Running MPI Jobs on Eiger
Let's suppose that you wish to run a parallel program (test.exe) with 24 MPI tasks for 30 minutes. With SLURM you can simply mention the number of tasks:
#!/bin/bash
#SBATCH --ntasks=24
#SBATCH --time=00:30:00
mpirun ./test.exe
On Eiger you use the "sbatch" submission command.
The “mpirun” command will be selected from the MPI module that is loaded. For example, if MVAPICH2 is loaded, then its process manager will be called. Likewise, for other MPI installations, users need to ensure that MPI library usage is consistent.
Alternatively, you can simply request the total number of nodes you require and then allocate your tasks and/or threads to them using suitable mpirun parameters. So, for 24 MPI processes, with one MPI process per core on Eiger, you need to request 2 nodes (2 nodes x 12 cores per node—note some Eiger nodes have 24 cores). Your SLURM job submission file could look like:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=12
#SBATCH --time=00:30:00
mpirun -np 24 ./test.exe
You can change to the number of MPI tasks given to each compute node with the mpirun "-np" option. The following is a job submission file for a 12 MPI processes job but now each compute node will run only 6 MPI processes.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=6
#SBATCH --time=00:30:00
mpirun -np 12 ./test.exe
Running MPI + OpenMP Jobs on Eiger
It is possible to run hybrid MPI+OpenMP parallel jobs on Eiger. Let's assume that your parallel program requires 4 MPI tasks and that each MPI task can spawn 6 OpenMP threads and you would like to run the job on two nodes. An appropriate job submission file could be the following:
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --time=00:30:00
export OMP_NUM_THREADS=6
export MV2_ENABLE_AFFINITY=0
mpirun -np 4 ./test.exe
Two MPI processes runs on each compute node, and that each process spawns 6 OpenMP threads. For MVAPICH2 implementation, MV2_ENABLE_AFFINITY=0 disables core affinity to allow MPI threads to be launched by the OS scheduler on available cores.
Running an Interactive Multi-node Job
There are two ways to accomplish this task.
1) Using MVAPICH2, you need to link you code with the pmi library provided by SLURM
> mpicc –o test.exe test.c -L/apps/eiger/slurm/2.3.0/lib -lpmi
Then you can ask for a certain number of resources
> salloc –-ntasks=24
and you can use the “srun” command to run a job on allocated resources
> srun -n16 --mpi=none ./text.exe
2) You can ask for resources using “salloc” but then you need to find out which nodes have been assigned to you:
which will show the nodes assigned to your job. You can ssh to the nodes and run your jobs by specifying the list of nodes that are assigned for your interactive session.
PBS Pro versus SLURM Migration help table :
Here below follows a one-to-one PBS Pro versus SLURM directive mapping, useful when migrating from existing PBS Pro batch script towards SLURM-based batch script.
PBS Professional V 10.2 | SLURM V 2.3.0 pre-release 5 |
|---|---|
#PBS -N job_name | #SBATCH --job-name="My Job Name" |
#PBS -l select=1 | #SBATCH --nodes=1 |
#PBS -l mpiprocs=4 | #SBATCH --ntasks=4 |
#PBS -l ncpus=4 | #SBATCH --cpus-per-task=1 |
#PBS | #SBATCH --ntasks-per-node=4 |
#PBS -l host=eigerXXX | #SBATCH --nodelist=eigerXXX |
#PBS -M $USER@MAIL.DOMAIN | #SBATCH --mail-user=$USER@MAIL.DOMAIN |
#PBS -m abe | #SBATCH --mail-type=ALL |
#PBS -l walltime=HH:MM:SS | #SBATCH --time=HH:MM:SS |
#PBS -l min_walltime=HH:MM:SS | #SBATCH --time-min=HH:MM:SS |
#PBS -l cput=HH:MM:SS | #SBATCH |
#PBS -q <queue> | #SBATCH --partition=<queue> |
#PBS -q <reservation-queue> | #SBATCH --reservation=<reservation_name> |
#PBS -W group_list=a_group | #SBATCH --account=a_group |
#PBS -o /path/to/out | #SBATCH --output=/path/to/out (supports %j for jobid) |
#PBS -e /path/to/err | #SBATCH --error=/path/to/err (supports %j for jobid) |
#PBS -V | NOT REQUIRED ANYMORE |
#PBS -l gpu=fermi:ngpus=2 | #SBATCH --gres=gpu:2 |
#SBATCH --constraint=fermi | |

