EN Bereichsnavigation EN

Castor - IBM iDataPlex

CASTOR SLURM Batch Queuing System 

 
Starting from Tuesday 20 December 2011, the CSCS IBM iDataPlex Research & Development GPU Cluster Castor is operating the SLURM batch queuing system, which is "The Simple Linux Utility for Resource Management" (SLURM), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. 
 
The access to CASTOR cluster resources is always  possible in a batch or in an interactive way, using the SLURM client job submission commands : 
 
    - sbatch : submits a batch script to SLURM 
    - salloc : allocate a SLURM job allocation (set of nodes) 
    - srun     : run a parallel job 
 
The following SLURM end user commands are available in order to check the status of the current SLURM configured partitions (queues), resource (nodes) availability, to control and monitor jobs, to watch resource job usage and get job detailed information, etc. : 
 
    - squeue  : view information about jobs located in the SLURM scheduling queue 
 
    - sinfo   : view information about SLURM nodes and partitions 
 
    - smap    : graphically view information about SLURM jobs, partitions 
 
    - sview   : graphical user interface to view and modify SLURM state 
 
    - scancel : used to signal or cancel jobs or job steps 
 
    - scontrol : view SLURM current configuration 
 
    - sacct   : displays user job accounting data 
 

Further SLURM documentation

SLURM detailed documentation can be found on the official SLURM web site :

           SLURM official documentation

CASTOR SLURM current configuration

The current batch queuing system configuration offers 3 partitions (queues) :

production : parallel, 24 nodes, max. walltime 24 hours, open to all users

- express : parallel, 6 nodes, max. walltime 5 hours, open to all users

special : parallel, 2 visualization nodes, max. walltime 24 hours, open based on user requests

The current CASTOR resources available via SLURM are summarized here : 

CASTOR SLURM RESOURCES

Node

Core 

Mem

M/C

GPU

GPU#

Cap

GPU-t

GPU-c

GPU-m

GPU-f

GPU-mt

castor1

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor2

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor3

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor4

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor5

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor6

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor7

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor8

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor9

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor10

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor11

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor12

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor13

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor14

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor15

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor16

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor17

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor18

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor19

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor20

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor21

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor22

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor23

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor24

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor25

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor26

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor27

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor28

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor29

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor30

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor31

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

castor32

12

24 GB

2 GB

M2090

   2

2.0

fermi

512

6 GB

1.3 Ghz

GDDR5

 

 

GENERIC RESOURCE SCHEDULING (Gres)

SLURM is configured on CASTOR to support the Gres scheduling for Graphics Processing Units (GPUs). See here for further technical details : 


Two SLURM parameters can be used when submitting jobs in the system (via srunsbatch or salloc), in order to select the appropriate CASTOR node with the desired GPU resources :

  • --gres=gpu:N, where N=[1|2]
  • --constraint=[fermi|m2090|nvidia]

WARNING: The --gres=N parameter MUST be specified at command line or inside a SLURM batch script (ex. : #SBATCH --gres=2) whenever the access to 1 or both NVIDIA M2090 GPUs is required ! Forgetting to specify it, SLURM will restrict the access to none of the available GPUs on the node!

The --gres parameters controls the selection of how many gpus on a node a user need, while the --constraint parameter is used for the GPU type selection. Only one value must be provided in order to target the appropriate GPU type selection. 

  • --constraint=m2090

There is no need to specify as value for the --constraint parameter, exactly a particular kind of GPU type needed (m2090), since there is only exactly one kind of GPU type on any CASTOR node (i.e. NVIDIA FERMI M2090).

SLURM on CASTOR

In order to start familiarize with SLURM on CASTOR, we suggest you to follow the instructions steps reported here below :

See the above table for the available resource list for both nodes.

1. Connect to the  frontend server machine castor.cscs.ch via SSH protocol, by first connecting to the CSCS main entry point ELA (ela.cscs.ch)  :

           ssh -Y ela.cscs.ch

           ssh -Y castor.cscs.ch

2. Start becoming familiar with the end-user SLURM commands :

           squeue -a -l -u $USER

           sinfo -a -l 

           scontrol -a show partition

           scontrol show nodes

           scontrol -a show job

           sview &

3. Here below follows 3 examples of typical SLURM job submission templates. Try them with your typical application code and don't hesitate to report problems to help@cscs.ch by specifying in the subject line the keywords "CASTOR: SLURM problems". 

Simple SLURM interactive job request : 

You need for your interactive session a CASTOR compute node with at least 16 GB of main memory and 1 cpu for 4 hours, having an NVIDIA FERMI M2090 graphics card:    

          srun -N 1 -n 1 --partition=production --gres=gpu:1 --constraint=m2090 --mem=16g --time=04:00:00 --pty /bin/bash 

OR alternatively in case you need interactive access to multiple compute nodes (-N 2), you can reserve them by placing an allocation and then connect to the allocated nodes via SSH :    

          salloc -N 2 -n 1 --partition=production --gres=gpu:2 --constraint=m2090 --mem=16g --time=04:00:00

          scontrol show job JOBID

          ssh -Y castorX.login.cscs.ch

          ssh -Y castorY.login.cscs.ch

And once your work is terminated, release the nodes allocation, exiting from the shell where the salloc was launched.

Simple SLURM batch script for serial jobs : 

You need to submit a SLURM serial job requesting 1 cpu, 12 GB of main memory, 2 GPUs of type FERMI for 6 h and 30 minutes : 

 simple-serial.sh :

#!/bin/bash 
#SBATCH --job-name="SLURM TEST JOB" 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1 
#SBATCH --ntasks-per-node=1

#SBATCH --gres=gpu:2

#SBATCH --constraint=fermi
#SBATCH --mem=12g 
#SBATCH --time=06:30:00 
#SBATCH --partition=production
#SBATCH --account=$GROUP 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=$USER@MAIL.DOMAIN 
#SBATCH --output=/users/$USER/slurm-OUT.log 
#SBATCH --error=/users/$USER/slurm-ERR.log 
#======START=============================================

. /etc/profile.d/modules.bash 
# Load your required module files here
 module load cuda/5.0
echo "On which node your job has been scheduled :" 
echo $SLURM_JOB_NODELIST 
echo "Print current shell limits :" 
ulimit -a 
echo "Now run your serial tasks..." 
PLACE YOUR SERIAL CODE HERE 
#======END===============================================

Submit now your SLURM serial job : 

              sbatch simple-serial.sh 

Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive

Simple SLURM batch script for parallel-MPI jobs : 

You need to submit a parallel-MPI SLURM job requesting 4 FERMI nodes and 48 mpi tasks in total, 12 mpi tasks running per node, each one associated to exactly one cpu-core in a node, and needing a total of 32 GB of memory, 8 GB per node (2 GB per mpi task): 
 
simple-parallel.sh : 

#!/bin/bash 
#SBATCH --job-name="SLURM TEST JOB" 
#SBATCH --nodes=4 
#SBATCH --ntasks=48 
#SBATCH --cpus-per-task=1 
#SBATCH --ntasks-per-node=12 

#SBATCH --gres=gpu:2

#SBATCH --constraint=fermi
#SBATCH --mem=8g 
#SBATCH --time=01:30:00 
#SBATCH --partition=production
#SBATCH --account=$GROUP 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=$USER@MAIL.DOMAIN 
#SBATCH --output=/users/$USER/slurm-OUT.log 
#SBATCH --error=/users/$USER/slurm-ERR.log 
#======START=============================== 
. /etc/profile.d/modules.bash

module load cuda/5.0
module load mvapich2/1.9
 echo "On which nodes it executes :" 
echo $SLURM_JOB_NODELIST 
 echo "Which MPI Implementation is used :" 
which mpiexec.hydra 
mpiexec.hydra -info 
 echo "Print current shell limits :" 
ulimit -a 
 echo "Now run the MPI tasks..." 
mpiexec.hydra -np 48 -ppn 12 /users/$USER/MY_MPI_EXEC 
#======END=================================    
 
Now submit your SLURM Parallel-MPI job : 
 
    sbatch simple-parallel.sh 

Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive

Running MPI Jobs on CASTOR

Let's suppose that you wish to run a parallel program (test.exe) with 24 MPI tasks for 30 minutes. With SLURM you can simply mention the number of tasks:

#!/bin/bash
#SBATCH --ntasks=24
#SBATCH --time=00:30:00 
 

mpirun ./test.exe

On Castor you use the "sbatch" submission command. 

The “mpirun” command will be selected from the MPI module that is loaded. For example, if MVAPICH2 is loaded, then its process manager will be called. Likewise, for other MPI installations, users need to ensure that MPI library usage is consistent. 

Alternatively, you can simply request the total number of nodes you require and then allocate your tasks and/or threads to them using suitable mpirun parameters. So, for 24 MPI processes, with one MPI process per core on Castor, you need to request 2 nodes (2 nodes x 12 cores per node). Your SLURM job submission file could look like: 

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=12
#SBATCH --time=00:30:00 
 

mpirun -np 24 ./test.exe

You can change to the number of MPI tasks given to each compute node with the mpirun "-np" option. The following is a job submission file for a 12 MPI processes job but now each compute node will run only 6 MPI processes. 

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=6
#SBATCH --time=00:30:00 
 

mpirun -np 12 ./test.exe

Running MPI + OpenMP Jobs on CASTOR

It is possible to run hybrid MPI+OpenMP parallel jobs on Castor. Let's assume that your parallel program requires 4 MPI tasks and that each MPI task can spawn 6 OpenMP threads and you would like to run the job on two nodes. An appropriate job submission file could be the following: 

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --time=00:30:00 
 

export OMP_NUM_THREADS=6
export MV2_ENABLE_AFFINITY=0

mpirun -np 4 ./test.exe

Two MPI processes runs on each compute node, and that each process spawns 6 OpenMP threads. For MVAPICH2 implementation, MV2_ENABLE_AFFINITY=0 disables core affinity to allow MPI threads to be launched by the OS scheduler on available cores.   

Running an Interactive Multi-node Job

There are two ways to accomplish this task.

1) Using MVAPICH2, you need to link your code with the pmi library provided by SLURM

> mpicc –o test.exe test.c -L/apps/castor/slurm/default/lib -lpmi

Then you can ask for a certain number of resources

> salloc –-ntasks=24

and you can use the “srun” command to run a job on allocated resources

> srun -n16 --mpi=none ./text.exe

2) You can ask for resources using “salloc” but then you need to find out which nodes have been assigned to you:

> scontrol show job [ID]

which will show the nodes assigned to your job. You can ssh to the nodes and run your jobs by specifying the list of nodes that are assigned for your interactive session.