Castor - IBM iDataPlex
CASTOR SLURM Batch Queuing System
Starting from Tuesday 20 December 2011, the CSCS IBM iDataPlex Research & Development GPU Cluster Castor is operating the SLURM batch queuing system, which is "The Simple Linux Utility for Resource Management" (SLURM), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
The access to CASTOR cluster resources is always possible in a batch or in an interactive way, using the SLURM client job submission commands :
- sbatch : submits a batch script to SLURM
- salloc : allocate a SLURM job allocation (set of nodes)
- srun : run a parallel job
The following SLURM end user commands are available in order to check the status of the current SLURM configured partitions (queues), resource (nodes) availability, to control and monitor jobs, to watch resource job usage and get job detailed information, etc. :
- squeue : view information about jobs located in the SLURM scheduling queue
- sinfo : view information about SLURM nodes and partitions
- smap : graphically view information about SLURM jobs, partitions
- sview : graphical user interface to view and modify SLURM state
- scancel : used to signal or cancel jobs or job steps
- scontrol : view SLURM current configuration
- sacct : displays user job accounting data
Further SLURM documentation
SLURM detailed documentation can be found on the official SLURM web site :
CASTOR SLURM current configuration
The current batch queuing system configuration offers 3 partitions (queues) :
- production : parallel, 24 nodes, max. walltime 24 hours, open to all users
- express : parallel, 6 nodes, max. walltime 5 hours, open to all users
- special : parallel, 2 visualization nodes, max. walltime 24 hours, open based on user requests
The current CASTOR resources available via SLURM are summarized here :
Node | Core | Mem | M/C | GPU | GPU# | Cap | GPU-t | GPU-c | GPU-m | GPU-f | GPU-mt |
|---|---|---|---|---|---|---|---|---|---|---|---|
castor1 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor2 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor3 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor4 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor5 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor6 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor7 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor8 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor9 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor10 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor11 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor12 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor13 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor14 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor15 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor16 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor17 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor18 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor19 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor20 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor21 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor22 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor23 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor24 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor25 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor26 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor27 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor28 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor29 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor30 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor31 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
castor32 | 12 | 24 GB | 2 GB | M2090 | 2 | 2.0 | fermi | 512 | 6 GB | 1.3 Ghz | GDDR5 |
GENERIC RESOURCE SCHEDULING (Gres)
SLURM is configured on CASTOR to support the Gres scheduling for Graphics Processing Units (GPUs). See here for further technical details :
Two SLURM parameters can be used when submitting jobs in the system (via srun, sbatch or salloc), in order to select the appropriate CASTOR node with the desired GPU resources :
- --gres=gpu:N, where N=[1|2]
- --constraint=[fermi|m2090|nvidia]
WARNING: The --gres=N parameter MUST be specified at command line or inside a SLURM batch script (ex. : #SBATCH --gres=2) whenever the access to 1 or both NVIDIA M2090 GPUs is required ! Forgetting to specify it, SLURM will restrict the access to none of the available GPUs on the node!
The --gres parameters controls the selection of how many gpus on a node a user need, while the --constraint parameter is used for the GPU type selection. Only one value must be provided in order to target the appropriate GPU type selection.
- --constraint=m2090
There is no need to specify as value for the --constraint parameter, exactly a particular kind of GPU type needed (m2090), since there is only exactly one kind of GPU type on any CASTOR node (i.e. NVIDIA FERMI M2090).
SLURM on CASTOR
In order to start familiarize with SLURM on CASTOR, we suggest you to follow the instructions steps reported here below :
See the above table for the available resource list for both nodes.
1. Connect to the frontend server machine castor.cscs.ch via SSH protocol, by first connecting to the CSCS main entry point ELA (ela.cscs.ch) :
ssh -Y ela.cscs.ch
ssh -Y castor.cscs.ch
2. Start becoming familiar with the end-user SLURM commands :
squeue -a -l -u $USER
sinfo -a -l
scontrol -a show partition
scontrol show nodes
scontrol -a show job
sview &
3. Here below follows 3 examples of typical SLURM job submission templates. Try them with your typical application code and don't hesitate to report problems to help@cscs.ch by specifying in the subject line the keywords "CASTOR: SLURM problems".
Simple SLURM interactive job request :
You need for your interactive session a CASTOR compute node with at least 16 GB of main memory and 1 cpu for 4 hours, having an NVIDIA FERMI M2090 graphics card:
srun -N 1 -n 1 --partition=production --gres=gpu:1 --constraint=m2090 --mem=16g --time=04:00:00 --pty /bin/bash
OR alternatively in case you need interactive access to multiple compute nodes (-N 2), you can reserve them by placing an allocation and then connect to the allocated nodes via SSH :
salloc -N 2 -n 1 --partition=production --gres=gpu:2 --constraint=m2090 --mem=16g --time=04:00:00
scontrol show job JOBID
ssh -Y castorX.login.cscs.ch
ssh -Y castorY.login.cscs.ch
And once your work is terminated, release the nodes allocation, exiting from the shell where the salloc was launched.
Simple SLURM batch script for serial jobs :
You need to submit a SLURM serial job requesting 1 cpu, 12 GB of main memory, 2 GPUs of type FERMI for 6 h and 30 minutes :
simple-serial.sh :
#!/bin/bash
#SBATCH --job-name="SLURM TEST JOB"
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --constraint=fermi
#SBATCH --mem=12g
#SBATCH --time=06:30:00
#SBATCH --partition=production
#SBATCH --account=$GROUP
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@MAIL.DOMAIN
#SBATCH --output=/users/$USER/slurm-OUT.log
#SBATCH --error=/users/$USER/slurm-ERR.log
#======START=============================================
. /etc/profile.d/modules.bash
# Load your required module files here
module load cuda/5.0
echo "On which node your job has been scheduled :"
echo $SLURM_JOB_NODELIST
echo "Print current shell limits :"
ulimit -a
echo "Now run your serial tasks..."
PLACE YOUR SERIAL CODE HERE
#======END===============================================
Submit now your SLURM serial job :
sbatch simple-serial.sh
Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive
Simple SLURM batch script for parallel-MPI jobs :
You need to submit a parallel-MPI SLURM job requesting 4 FERMI nodes and 48 mpi tasks in total, 12 mpi tasks running per node, each one associated to exactly one cpu-core in a node, and needing a total of 32 GB of memory, 8 GB per node (2 GB per mpi task):
simple-parallel.sh :
#!/bin/bash
#SBATCH --job-name="SLURM TEST JOB"
#SBATCH --nodes=4
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=12
#SBATCH --gres=gpu:2
#SBATCH --constraint=fermi
#SBATCH --mem=8g
#SBATCH --time=01:30:00
#SBATCH --partition=production
#SBATCH --account=$GROUP
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@MAIL.DOMAIN
#SBATCH --output=/users/$USER/slurm-OUT.log
#SBATCH --error=/users/$USER/slurm-ERR.log
#======START===============================
. /etc/profile.d/modules.bash
module load cuda/5.0
module load mvapich2/1.9
echo "On which nodes it executes :"
echo $SLURM_JOB_NODELIST
echo "Which MPI Implementation is used :"
which mpiexec.hydra
mpiexec.hydra -info
echo "Print current shell limits :"
ulimit -a
echo "Now run the MPI tasks..."
mpiexec.hydra -np 48 -ppn 12 /users/$USER/MY_MPI_EXEC
#======END=================================
Now submit your SLURM Parallel-MPI job :
sbatch simple-parallel.sh
Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive
Running MPI Jobs on CASTOR
Let's suppose that you wish to run a parallel program (test.exe) with 24 MPI tasks for 30 minutes. With SLURM you can simply mention the number of tasks:
#!/bin/bash
#SBATCH --ntasks=24
#SBATCH --time=00:30:00
mpirun ./test.exe
On Castor you use the "sbatch" submission command.
The “mpirun” command will be selected from the MPI module that is loaded. For example, if MVAPICH2 is loaded, then its process manager will be called. Likewise, for other MPI installations, users need to ensure that MPI library usage is consistent.
Alternatively, you can simply request the total number of nodes you require and then allocate your tasks and/or threads to them using suitable mpirun parameters. So, for 24 MPI processes, with one MPI process per core on Castor, you need to request 2 nodes (2 nodes x 12 cores per node). Your SLURM job submission file could look like:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=12
#SBATCH --time=00:30:00
mpirun -np 24 ./test.exe
You can change to the number of MPI tasks given to each compute node with the mpirun "-np" option. The following is a job submission file for a 12 MPI processes job but now each compute node will run only 6 MPI processes.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=6
#SBATCH --time=00:30:00
mpirun -np 12 ./test.exe
Running MPI + OpenMP Jobs on CASTOR
It is possible to run hybrid MPI+OpenMP parallel jobs on Castor. Let's assume that your parallel program requires 4 MPI tasks and that each MPI task can spawn 6 OpenMP threads and you would like to run the job on two nodes. An appropriate job submission file could be the following:
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --time=00:30:00
export OMP_NUM_THREADS=6
export MV2_ENABLE_AFFINITY=0
mpirun -np 4 ./test.exe
Two MPI processes runs on each compute node, and that each process spawns 6 OpenMP threads. For MVAPICH2 implementation, MV2_ENABLE_AFFINITY=0 disables core affinity to allow MPI threads to be launched by the OS scheduler on available cores.
Running an Interactive Multi-node Job
There are two ways to accomplish this task.
1) Using MVAPICH2, you need to link your code with the pmi library provided by SLURM
> mpicc –o test.exe test.c -L/apps/castor/slurm/default/lib -lpmi
Then you can ask for a certain number of resources
> salloc –-ntasks=24
and you can use the “srun” command to run a job on allocated resources
> srun -n16 --mpi=none ./text.exe
2) You can ask for resources using “salloc” but then you need to find out which nodes have been assigned to you:
which will show the nodes assigned to your job. You can ssh to the nodes and run your jobs by specifying the list of nodes that are assigned for your interactive session.

