SLURM (https://computing.llnl.gov/linux/slurm/) provides a rich set of features for organizing your workload and provides an extensive array of tools for managing your resource usage, however in normal interaction with the batch system you only need to know three basic commands:
sbatch - submit a batch script
squeue - check the status of jobs on the system
scancel - delete one of your jobs from the queue
To construct an appropriate SLURM job submission file for your parallel job you write a shell script and place a set of directives at the beginning, and in the case of SLURM these directives are issued by starting a line with the string "#SBATCH" (as a note for former PBS batch system users, this is the SLURM equivalent of "#PBS"). A suitable batch script is then submitted to the batch system using the 'sbatch' command.
In the following sections, we present some simple examples that highlight the most common job types to be executed on CSCS machines. Let us remind you that your account will be charged per node usage, whether you submit batch jobs with sbatch or you use interactive sessions with salloc: however using the latter is discouraged, unless needed for debugging.
I/O Redirection and Job Naming
A basic batch script can be written just using the '--ntasks' and '--time' directives, but extra directives will give you more control on how your job is run.
Output and Error
By default the output of your script will be put into a file of the form slurm-<XXXX>.out where <XXXX> is the SLURM batch job number of your job, and the error will be put into a file called slurm-<XXXX>.err with both of these being placed in the directory from which you launched the job.
Note that with SLURM the output file is created as soon as your job starts running, and the output from your job is placed in this file as the job progresses so that you can directly monitor your job's progress! Therefore do not delete this file while your job is running or else you will lose your output!
As the default names for the output and error files are not very meaningful, you might wish to change them, and in this case you can use the "--output" and "--error" directives in your batch script as follows:
If you wish to put output and error into the same file then you would simply repeat the filename for both output and error as here:
You can use the identifier '%j' in the output and error names to indicate the job ID of as an extra help if you still wish to get separate output files for each separate run, for example:
You can give your job a name that will appear in the output of the squeue command by issuing the '--job-name' directive.
Submitting Jobs from Multiple Projects
By default SLURM will charge your jobs against the project associated with your group ID, but if your account is able to run from multiple production projects at CSCS you need to use the directive '#SBATCH --account=' to tell SLURM which project to charge the job against:
If you change the project account under which your job is to run then not only will the job be charged against that project, but also the priority of your job will be calculated according to that project's usage rather than that of your default project. If you are not allowed to submit a job against a particular project then SLURM will reject your job at submission time.
For full information about the sbatch command, please consult the man page with 'man sbatch' directly from the command line on the machine that you are using.
With SLURM you can simply request the total number of tasks that you require and the time that you need for the job to run. Let's suppose that you wish to run a parallel program (test.exe) with 128 MPI tasks for 30 minutes. You use #SBATCH --ntasks=128 to request 128 MPI processes, and #SBATCH --time="00:30:00" to ask for 30 minutes and you then simply repeat the 128 task request on the aprun line. The simplest form of a SLURM job submission file batch.txt would look like the following:
aprun -n 128 ./test.exe
The flag -l at the beginning allows you to call the module command within the script, in case you need it. You would then submit this job into the batch system using the command sbatch and the name of your file as in:
bash$ sbatch batch.txt
Note that your environment variables are exported to the batch job by default, and your script is automatically run in the same directory from which you submitted the job.
Experienced PBSpro users should note this difference which is equivalent to using the '#PBS -V' directive in the batch script header to export your environment and then placing the command 'cd $PBS_O_WORKDIR' as the first executable line of your script in order to put you in the directory from which you submitted the job.
Within the script you launch your parallel executable using the Cray application placement run command aprun. By default, SLURM will allocate MPI tasks so that compute nodes are fully populated, but sometimes it might be necessary to reduce the number of MPI tasks given to each reserved compute node (you might be using a threaded application or your MPI tasks might require more physical memory, for example). You can change to the number of MPI tasks given to each compute node with the aprun -N option, but this must be accompanied in the batch script header by a #SBATCH --ntasks-per-node= directive. The following is a job submission file for a 128 MPI processes job but now each compute node will run only 8 MPI processes.
aprun -n 128 -N 8 ./test.exe
Note: let us remind you once more that you are charged for the number of nodes used, not number of cores used, because nodes cannot be used by more than one job simultaneously. Therefore this job would reserve 16 nodes and will be charged more than a job using all cores available per node: for instance, on Monte Rosa you have 32 cores available per node, therefore using 8 cores only per node you will be charged 4 times more than using all the 32 cores available per node.
Note that all of the directives can be given directly on the sbatch command line instead of using the directive #SBATCH: any directive given on the command line will override those in the batch script. So even if you specify a time limit in your batch script, you can always issue a --time= directive directly on the sbatch submission line and the new directive will override what was specified in the batch script. In the case of the previous job submission script, then you could submit it with an increased time request of 1 hour by issuing the command:
bash$ sbatch --time=01:00:00 batch.txt
It is possible to run hybrid MPI+OpenMP parallel jobs by suitable modifications to your batch script. Let's assume that your parallel program requires 64 MPI tasks and that each MPI task can spawn 8 OpenMP threads. We need to modify the batch script header to specify how many MPI tasks we wish to launch per node with the directive #SBATCH --ntasks-per-node and how many cores per process to reserve for OpenMP threads with the directive #SBATCH --cpus-per-task
aprun -n 64 -N 2 -d 8 ./test.exe
In this case, 2 MPI processes would run on each compute node, and each of those processes will spawn 8 OpenMP threads (set by the -d flag). Note that the batch script and aprun line would be the same for other threaded programming models such as POSIX threads.
Sometimes it is not possible to submit all of your jobs at the same time due to interdependencies between jobs, for example:
- When a batch job requires input data produced by a previous job
- When a post-processing job needs to wait for the main number-crunching job to finish
- When a full run would need to run for much longer than the allowed batch submission time, and therefore a set of jobs needs to be submitted that have to run sequentially one after the other
In these cases generating a sequence of jobs which have dependencies can allow you to submit all of your workload at the same time and then leave the batch system to schedule the work in the correct order.
In SLURM a job can be submitted that depends on a previous job by issuing the directive #SBATCH --dependency= and then the qualifier as to under what circumstances the job should be deemed eligible to run. As you need to know the job ID number of the job on which you depend, typically you won't place the directive inside your batch script, issuing it directly on the command line as an extra option to sbatch instead. The two most important qualifiers are:
afterok: this job can begin execution when the other job has finished successfully
afterany: this job can begin execution when the other job finishes, even if it had problems
In these circumstances, a job is deemed to have finished successfully if the last command of the batch script finishes cleanly (with an exit code of 0).
Dependency Example - Post-processing
In this simple example we have a main number-crunching job, and a post-processing job, with the post-processing job needing the main job to finish cleanly in order to be sure that the correct data has been generated.
Our batch script for the main job (batch.txt) might look like this:
aprun -n 2048 ./number_cruncher
and we submit this job with the command sbatch batch.txt
We get the output "Submitted batch job 11901", where 11901 is the job id returned by SLURM for the main job. Our post-processing job has the following script (post.txt):
aprun -n 256 ./post_processor
and we would now submit this job with a dependency on the previous job 11901 as follows: sbatch --dependency=afterok:11901 post.txt, getting "Submitted batch job 11905" in output onthe screen. The post processing job 11905 will now sit in the queue and will only become eligible to run after the main job has finished: please note that until this time its priority will not be calculated and it will not age in the queues. We issued the '--dependency' option on the command line directly to the sbatch command rather than placing it in our script, as we needed to know the job ID of the previous job and this line would not be reusable in a batch script as the next time the job ID would have changed.
Thread depth and memory usage
For threaded applications, use the --cpus-per-task/-c parameter of sbatch/salloc to set the thread depth per node. This corresponds to the aprun -d parameter (a note for former PBS users: mppdepth is the PBS equivalent). Please note that SLURM does not set the OMP_NUM_THREADS environment variable.
Hence, if an application spawns 4 threads, an example script would look like:
#SBATCH --comment="illustrate the use of thread depth and OMP_NUM_THREADS"
aprun -n 3 -d $OMP_NUM_THREADS ./my_exe
Specifying the number of tasks per node
SLURM uses the same default as the Cray Application Level Placement Scheduler (ALPS), assigning each task to a single core/CPU. In order to make more resources available per task, you can reduce the number of processing elements per node with the --ntasks-per-node option of sbatch/salloc equivalent to the aprun -N parameter (mppnppn in PBS). This is in particular necessary when tasks require more memory than the per-CPU default.
Specifying the memory per-task
In Cray terminology, a task is also called a "processing element" (PE), hence below we refer to the per-task memory and "per-PE" memory interchangeably. The per-PE memory requested through the batch system corresponds to the aprun -m parameter.The default memory available per task, with the implicit assumption you run 1 task per core/CPU, will be the per-CPU share of node_memory/number_of_cores: if nothing else is specified, the --mem option to sbatch/salloc can only be used to reduce the per-PE memory below the per-CPU share. This is also the only way that the --mem-per-cpu option can be applied (besides, the --mem-per-cpu option is ignored if the user forgets to specify --ntasks/-n). Thus, the preferred way of specifying memory is the more general --mem option.In order to increase the per-PE memory settable via the --mem option, you need to make more per-task resources available using the --ntasks-per-node option to sbatch/salloc. This allows --mem to request up to node_memory/ntasks_per_node MB. When --ntasks-per-node is 1, the entire node memory may be requested by the application. Setting --ntasks-per-node to the number of cores per node yields the default per-CPU share minimum value. For all cases in between these extremes, set --mem=per_task_memory and --ntasks-per-node=floor(node_memory/per_task_memory) whenever per_task_memory needs to be larger than the per-CPU share.
1) Requesting less than the per-CPU share:
#SBATCH --comment="limiting the memory to 256MB per task"
2) An application needs at least 2 GB per task, if running with 32 tasks:
#SBATCH --comment="rosa, 32 tasks each requiring 2 GB per task"