EN Bereichsnavigation EN

Eiger - Dalco SM System

EIGER SLURM Batch Queuing System 

 
Starting from Wednesday 07 September 2011, the Visualization, Research & Development Cluster EIGER is operating the SLURM batch queuing system, which is "The Simple Linux Utility for Resource Management" (SLURM), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. 
 
The access to EIGER cluster resources is still  possible in a batch or in an interactive way, using the SLURM client job submission commands : 
 
    - sbatch : submits a batch script to SLURM 
    - salloc : allocate a SLURM job allocation (set of nodes) 
    - srun     : run a parallel job 
 
The following SLURM end user commands are available in order to check the status of the current SLURM configured partitions (queues), resource (nodes) availability, to control and monitor jobs, to watch resource job usage and get job detailed information, etc. : 
 
    - squeue  : view information about jobs located in the SLURM scheduling queue 
 
    - sinfo   : view information about SLURM nodes and partitions 
 
    - smap    : graphically view information about SLURM jobs, partitions 
 
    - sview   : graphical user interface to view and modify SLURM state 
 
    - scancel : used to signal or cancel jobs or job steps 
 
    - scontrol : view SLURM current configuration 
 
    - sacct   : displays user job accounting data 
 

Further SLURM documentation

SLURM detailed documentation can be found on the official SLURM web site :

           SLURM official documentation

EIGER SLURM current configuration

The current batch queuing system configuration offers 2 distinguished partitions (queues) :

stdMem  parallel 13 nodes

largeMem parallel 8 nodes

 

The current EIGER resources available via SLURM are summarized here : 

EIGER SLURM RESOURCES

Node

Core 

Mem

M/C

GPU

GPU#

Cap

GPU-t

GPU-c

GPU-m

GPU-f

GPU-mt

eiger200

12

24 GB

2 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger201

12

24 GB

2 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger202

12

24 GB

2 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger203

12

24 GB

2 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger204

12

24 GB

2 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger205

12

24 GB

2 GB

GTX 480

   1

2.0

fermi

480

1536 MB

1.4 Ghz

GDDR5

eiger206

12

24 GB

2 GB

GTX 480

   1

2.0

fermi

480

1536 MB

1.4 Ghz

GDDR5

eiger207

24

48 GB

2 GB

M2050

   2

2.0

fermi

448

2.6 GB

1.15 Ghz

GDDR5

eiger208

24

48 GB

2 GB

M2050

   2

2.0

fermi

448

2.6 GB

1.15 Ghz

GDDR5

eiger209

24

48 GB

2 GB

C2070

   2

2.0

fermi

448

5.4 GB

1.15 Ghz

GDDR5

eiger210

24

48 GB

2 GB

C2070

   2

2.0

fermi

448

5.4 GB

1.15 Ghz

GDDR5

eiger220

12

48 GB

4 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger221

12

48 GB

4 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger222

12

48 GB

4 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger223

12

48 GB

4 GB

GTX 285

   1

1.3

geforce

240

2 GB

1.48 Ghz

GDDR3

eiger240

12

24 GB

2 GB

S1070

   2

1.3

tesla

240

4 GB

1.3 Ghz

GDDR3

eiger241

12

24 GB

2 GB

S1070

   2

1.3

tesla

240

4 GB

1.3 Ghz

GDDR3

eiger242

12

24 GB

2 GB

C2070

   2

2.0

fermi

448

5.4 GB

1.15 Ghz

GDDR5

eiger243

12

24 GB

2 GB

C2070

   2

2.0

fermi

448

5.4 GB

1.15 Ghz

GDDR5

eiger180

12

24 GB

2 GB

GTX 480

   1

2.0

fermi

480

1536 MB

1.4 Ghz

GDDR5

eiger181

12

24 GB

2 GB

GTX 480

   1

2.0

fermi

480

1536 MB

1.4 Ghz

GDDR5

 

 

GENERIC RESOURCE SCHEDULING (Gres) 

SLURM is configured on EIGER to support the Gres scheduling for Graphics Processing Units (GPUs). See here for further technical details : 


Two SLURM parameters can be used when submitting jobs in the system (via srunsbatch or salloc), in order to select the appropriate EIGER node with the desired GPU resources :

  • --gres=gpu:N, where N=[1|2]
  • --constraint=[geforce|gtx285|fermi|gtx480|m2050|c2070|s1070|tesla|m2090|nvidia]

The --gres parameters controls the selection of how many gpus on a node a user need, while the --constraint parameter is used for the GPU type selection. Only one value must be provided in order to target the appropriate GPU type selection. 

  • --constraint=nvidia

We recommend to specify as value for the --constraint parameter, exactly a particular kind of GPU type needed (s1070,c2070,m2050,m2090,gtx480,gtx285,hd6970), since your GPU-side application code (kernel) could be better optimized on a particular GPU type, taking advantage from the specific GPU architecture and features (NVIDIA, Fermi vs non-Fermi GPUs, etc.)

SLURM on EIGER

In order to start familiarize with SLURM on EIGER, we suggest you to follow the instructions steps reported here below :

See the above table for the available resource list for both nodes.

1. Connect to the  frontend server machine eiger.cscs.ch via SSH protocol, by first connecting to the CSCS main entry point ELA (ela.cscs.ch)  :

           ssh -Y ela.cscs.ch

           ssh -Y eiger.cscs.ch

2. Start becoming familiar with the end-user SLURM commands :

           squeue -a -l -u $USER

           sinfo -a -l 

           scontrol -a show partition

           scontrol show nodes

           scontrol -a show job

           sview &

3. Here below follow 3 examples of typical SLURM job submission templates. Try them with your typical application code and don't hesitate to report problems to help@cscs.ch by specifying in the subject line the keywords "EIGER: SLURM migration problems". 

Simple SLURM interactive job request : 

You need for your interactive session an EIGER compute node with at least 23 GB of main memory and 1 cpu for 4 hours, having an NVIDIA Geforce GTX 285 graphics card:    

          srun -N 1 -n 1 --gres=gpu:1 --constraint=gtx285 --mem=23g --time=04:00:00 --pty /bin/bash 

OR alternatively in case you need interactive access to multiple compute nodes (-N 2), you can reserve them by placing an allocation and then connect to the allocated nodes via SSH :    

          salloc -N 2 -n 1 --gres=gpu:2 --constraint=gtx285 --mem=23g --time=04:00:00

          scontrol show job JOBID

          ssh -Y eigerXXX.login.cscs.ch

          ssh -Y eigerYYY.login.cscs.ch

And once your work is terminated, release the nodes allocation, exiting from the shell where the salloc was launched.

Simple SLURM batch script for serial jobs : 

You need to submit a SLURM serial job requesting 1 cpu, 12 GB of main memory, 1 GPU of type FERMI for 6 h and 30 minutes : 

 simple-serial.sh :

#!/bin/bash 
#SBATCH --job-name="SLURM TEST JOB" 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1 
#SBATCH --ntasks-per-node=1

#SBATCH --gres=gpu:1

#SBATCH --constraint=fermi
#SBATCH --mem=12g 
#SBATCH --time=06:30:00 
#SBATCH --partition=stdMem
#SBATCH --account=$GROUP 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=$USER@MAIL.DOMAIN 
#SBATCH --output=/users/$USER/slurm-OUT.log 
#SBATCH --error=/users/$USER/slurm-ERR.log 
#======START=============================================

. /etc/profile.d/modules.bash 
# Load your required module files here
 module load MODULE_TO_BE_LOADED
echo "On which node your job has been scheduled :" 
echo $SLURM_JOB_NODELIST 
echo "Print current shell limits :" 
ulimit -a 
echo "Now run your serial tasks..." 
PLACE YOUR SERIAL CODE HERE 
#======END===============================================

Submit now your SLURM serial job : 

              sbatch simple-serial.sh 

Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive

Simple SLURM batch script for parallel-MPI jobs : 

You need to submit a parallel-MPI SLURM job requesting 4 FERMI nodes and 48 mpi tasks in total, 12 mpi tasks running per node, each one associated to exactly one cpu-core in a node, and needing a total of 32 GB of memory, 8 GB per node (2 GB per mpi task): 
 
simple-parallel.sh : 

#!/bin/bash 
#SBATCH --job-name="SLURM TEST JOB" 
#SBATCH --nodes=4 
#SBATCH --ntasks=48 
#SBATCH --cpus-per-task=1 
#SBATCH --ntasks-per-node=12 

#SBATCH --gres=gpu:1

#SBATCH --constraint=fermi
#SBATCH --mem=8g 
#SBATCH --time=01:30:00 
#SBATCH --partition=stdMem
#SBATCH --account=$GROUP 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=$USER@MAIL.DOMAIN 
#SBATCH --output=/users/$USER/slurm-OUT.log 
#SBATCH --error=/users/$USER/slurm-ERR.log 
#======START=============================== 
. /etc/profile.d/modules.bash 
module load mvapich2/1.7 
 echo "On which nodes it executes :" 
echo $SLURM_JOB_NODELIST 
 echo "Which MPI Implementation is used :" 
which mpiexec.hydra 
mpiexec.hydra -info 
 echo "Print current shell limits :" 
ulimit -a 
 echo "Now run the MPI tasks..." 
mpiexec.hydra -rmk slurm -ppn 12 /users/$USER/MY_MPI_EXEC 
#======END=================================    
 
Now submit your SLURM Parallel-MPI job : 
 
    sbatch simple-parallel.sh 

Remarks: replace $GROUP with your primary group membership, $USER with your username and provide your e-mail address for the --mail-user directive

Running MPI Jobs on Eiger

Let's suppose that you wish to run a parallel program (test.exe) with 24 MPI tasks for 30 minutes. With SLURM you can simply mention the number of tasks:

#!/bin/bash
#SBATCH --ntasks=24
#SBATCH --time=00:30:00 
 

mpirun ./test.exe

On Eiger you use the "sbatch" submission command. 

The “mpirun” command will be selected from the MPI module that is loaded. For example, if MVAPICH2 is loaded, then its process manager will be called. Likewise, for other MPI installations, users need to ensure that MPI library usage is consistent. 

Alternatively, you can simply request the total number of nodes you require and then allocate your tasks and/or threads to them using suitable mpirun parameters. So, for 24 MPI processes, with one MPI process per core on Eiger, you need to request 2 nodes (2 nodes x 12 cores per node—note some Eiger nodes have 24 cores). Your SLURM job submission file could look like: 

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=12
#SBATCH --time=00:30:00 
 

mpirun -np 24 ./test.exe

You can change to the number of MPI tasks given to each compute node with the mpirun "-np" option. The following is a job submission file for a 12 MPI processes job but now each compute node will run only 6 MPI processes. 

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=6
#SBATCH --time=00:30:00 
 

mpirun -np 12 ./test.exe

Running MPI + OpenMP Jobs on Eiger

It is possible to run hybrid MPI+OpenMP parallel jobs on Eiger. Let's assume that your parallel program requires 4 MPI tasks and that each MPI task can spawn 6 OpenMP threads and you would like to run the job on two nodes. An appropriate job submission file could be the following: 

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --time=00:30:00 
 

export OMP_NUM_THREADS=6
export MV2_ENABLE_AFFINITY=0

mpirun -np 4 ./test.exe

Two MPI processes runs on each compute node, and that each process spawns 6 OpenMP threads. For MVAPICH2 implementation, MV2_ENABLE_AFFINITY=0 disables core affinity to allow MPI threads to be launched by the OS scheduler on available cores.   

Running an Interactive Multi-node Job

There are two ways to accomplish this task.

1) Using MVAPICH2, you need to link you code with the pmi library provided by SLURM

> mpicc –o test.exe test.c -L/apps/eiger/slurm/2.3.0/lib -lpmi

Then you can ask for a certain number of resources

> salloc –-ntasks=24

and you can use the “srun” command to run a job on allocated resources

> srun -n16 --mpi=none ./text.exe

2) You can ask for resources using “salloc” but then you need to find out which nodes have been assigned to you:

> scontrol show job [ID]

which will show the nodes assigned to your job. You can ssh to the nodes and run your jobs by specifying the list of nodes that are assigned for your interactive session.


 

PBS Pro versus SLURM Migration help table :

Here below follows a one-to-one PBS Pro versus SLURM directive mapping, useful when migrating from existing PBS Pro batch script towards SLURM-based batch script.

EIGER PBS Pro versus SLURM migration help table

PBS Professional V 10.2        

SLURM V 2.3.0 pre-release 5 

#PBS -N job_name

#SBATCH --job-name="My Job Name"

#PBS -l select=1

#SBATCH --nodes=1

#PBS -l mpiprocs=4

#SBATCH --ntasks=4

#PBS -l ncpus=4

#SBATCH --cpus-per-task=1

#PBS 

#SBATCH --ntasks-per-node=4

#PBS -l host=eigerXXX

#SBATCH --nodelist=eigerXXX

#PBS -M $USER@MAIL.DOMAIN

#SBATCH --mail-user=$USER@MAIL.DOMAIN

#PBS -m abe

#SBATCH --mail-type=ALL

#PBS -l walltime=HH:MM:SS

#SBATCH --time=HH:MM:SS

#PBS -l min_walltime=HH:MM:SS         

#SBATCH --time-min=HH:MM:SS

#PBS -l cput=HH:MM:SS

#SBATCH 

#PBS -q <queue>

#SBATCH --partition=<queue>

#PBS -q <reservation-queue>

#SBATCH --reservation=<reservation_name>

#PBS -W group_list=a_group

#SBATCH --account=a_group

#PBS -o /path/to/out

#SBATCH --output=/path/to/out (supports %j for jobid)

#PBS -e /path/to/err

#SBATCH --error=/path/to/err (supports %j for jobid)

#PBS -V

NOT REQUIRED ANYMORE

#PBS -l gpu=fermi:ngpus=2

#SBATCH --gres=gpu:2

#SBATCH --constraint=fermi