Running jobs

All CSCS systems use the SLURM batch system for the submission, control and management of user jobs.

SLURM (https://computing.llnl.gov/linux/slurm/) provides a rich set of features for organizing your workload and provides an extensive array of tools for managing your resource usage, however in normal interaction with the batch system you only need to know three basic commands:

sbatch    -  submit a batch script
squeue   -  check the status of jobs on the system
scancel   -  delete one of your jobs from the queue

An appropriate SLURM job submission file for your parallel job is a shell script with a set of directives at the beginning. These directives are issued by starting a line with the string "#SBATCH" (as a note for former PBS batch system users, this is the SLURM equivalent of "#PBS"). A suitable batch script is then submitted to the batch system using the 'sbatch' command.

Let us remind you that your account will be charged per node usage on most systems, whether you submit batch jobs with sbatch or you use interactive sessions with salloc (please have a look at the corresponding man pages for more details). Interactive allocations (salloc sessions) are for debugging only, and have a maximum wall time of one hour.

I/O Redirection and Job Naming 

A basic batch script can be written just using the '--ntasks' and '--time' directives, but extra directives will give you more control on how your job is run.

Output and Error

By default the output of your script will be put into a file of the form slurm-<XXXX>.out where <XXXX> is the SLURM batch job number of your job, and the error will be put into a file called slurm-<XXXX>.err with both of these being placed in the directory from which you launched the job.

Note that with SLURM the output file is created as soon as your job starts running, and the output from your job is placed in this file as the job progresses so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. You should also keep in mind that SLURM performs file buffering by default when writing on the output files. In practice, this means that the output of your job will not appear in the output files immediately. If you want to override this behaviour, you should pass the '-u' or '--unbuffered' to the srun command; the output then will appear in the file as soon as it is produced.

As the default names for the output and error files are not very meaningful, you might wish to change them, and in this case you can use the "--output" and "--error" directives in your batch script that you can run with sbatch.


#!/bin/bash -l
#
#SBATCH --job-name="hello_world_mpi"
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --output=hello_world_mpi.%j.o
#SBATCH --error=hello_world_mpi.%j.e

#======START=====
module load slurm
srun ./hello_world_mpi.x
#=====END====