EN Bereichsnavigation EN

Usage Policy

Tödi is a development system, therefore has a special usage policy:

Prime Time 8AM - 6PM : no production runs are allowed. Only jobs with a wall clock time of 1 hr can be submitted for development purpose.

Non Prime Time 6PM - 8AM + Weekends: free for production runs.

Running a SLURM job on Todi

Todi uses SLURM for the submission, monitoring and control of parallel jobs: parallel executables must be run using the aprun command, SLURM batch scripts need to be submitted with the sbatch command from your $SCRATCH folder only (users are NOT supposed to run jobs from $HOME or $PROJECT because of the low performance). No special SLURM or aprun settings are required to run on the GPUs. A simple SLURM job submission script would look like the following:

#!/bin/bash -l
#SBATCH --nodes=8
#SBATCH --ntasks=128
#SBATCH --time=00:30:00
aprun -B ./test.exe

The flag -l at the beginning allows you to call the module command within the script, in case you need it. The aprun flag -B will ask ALPS to implement the requests specified by the batch reservation: the scheduler will allocate 8 nodes using 16 tasks per node. Below you find a more extended example batch script (mpicuda.sbatch) that you can run with the following command:
sbatch mpicuda.sbatch

#!/bin/bash -l
#SBATCH --job-name="test" 
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1 
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:05:00 
#======START=============================== 
echo "On which nodes it executes" 
echo $SLURM_JOB_NODELIST 
echo "Now run the MPI tasks..." 
aprun -B ./mpicuda.x
#======END=================================

Note that multiple MPI tasks on a node can share a GPU.  In order to use this on Todi, set the CRAY_CUDA_PROXY environment variable to 1.  This feature can sometimes provide performance benefits for certain codes, especially codes where each MPI task is not able to continuously utilize its GPU.

Given below, a simple code that tests basic functionality of an MPI job with CUDA: it runs n CUDA kernels from different MPI processes and prints data on standard output. 

Main (mpicuda_main.c):

#include <stdio.h>
#include <mpi.h>
#ifndef DEVS_PER_NODE
#define DEVS_PER_NODE 2 /* Devices per node */
#endif

void run_gpu_kernel();
void get_gpu_info(char *, int);
void set_gpu(int);
void gethostname(char *, int);

int main(int argc, char *argv[]){
 int rank, size;

 MPI_Init (&argc, &argv);
 MPI_Comm_rank (MPI_COMM_WORLD, &rank); 
 MPI_Comm_size (MPI_COMM_WORLD, &size); 

 /* Get this host's name */
 char hostName[256] = "";
 gethostname(hostName, 256);

 /* Given the MPI rank, set the device number and get its info */
 char gpu_str[256] = "";
 int dev = rank % DEVS_PER_NODE;
 set_gpu(dev);
 get_gpu_info(gpu_str, dev);

 printf("MPI Rank %d on %s using Device %d (%s)\n", 
rank, hostName, dev, gpu_str);

 run_gpu_kernel();
 MPI_Finalize();

 return 0;
}

 

Functions (mpicuda.cu):

#include <stdio.h>
#define SIZE 12
/* Add two arrays on the device */
__global__ void gpu_kernel(int *d_a1, int *d_a2, int *d_a3, int N) {
 int idx = blockIdx.x*blockDim.x + threadIdx.x;
 if (idx < N)
 d_a3[idx] = d_a1[idx] + d_a2[idx];
}
extern "C"
void run_gpu_kernel(){
 int i; 
 int a1[SIZE], a2[SIZE], a3[SIZE]; /* Host arrays */
 int *d_a1, *d_a2, *d_a3; /* Device arrays */
 
 /* Initalize the host input arrays */
 for(i = 0; i < SIZE; i++) {
 a1[i] = i;
 a2[i] = 100*i;
 }
 
/* Allocate the device arrays and copy data to the device */
 cudaMalloc((void**) &d_a1, sizeof(int)*SIZE);
 cudaMalloc((void**) &d_a2, sizeof(int)*SIZE);
 cudaMalloc((void**) &d_a3, sizeof(int)*SIZE);
 cudaMemcpy(d_a1, a1, sizeof(int)*SIZE, cudaMemcpyHostToDevice);
 cudaMemcpy(d_a2, a2, sizeof(int)*SIZE, cudaMemcpyHostToDevice);
 
 gpu_kernel<<<3, 4>>>(d_a1, d_a2, d_a3, SIZE);
 
 cudaMemcpy(a3, d_a3, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
 
 /* Print the output array */
 for(i = 0; i < SIZE; i++) {
 printf("%d ", a3[i]);
 }
 printf("\n");

 cudaFree(d_a1);
 cudaFree(d_a2);
 cudaFree(d_a3);
}
extern "C"
void get_gpu_info(char *gpu_string, int dev){
 struct cudaDeviceProp dprop;
 cudaGetDeviceProperties(&dprop, dev);
 strcpy(gpu_string,dprop.name);
}
extern "C"
void set_gpu(int dev){
 cudaSetDevice(dev);
}

 

How to compile the source code and submit a batch job:

module load cudatoolkit
nvcc -c mpicuda.cu
cc -DDEVS_PER_NODE=1 -o mpicuda.x mpicuda_main.c mpicuda.o

Write a SLURM batch script, following the example provided above and run it with sbatch mpicuda.sbatch

The output should look like the following: 

On which nodes it executes :
 nid000[62-63]
 Now run the MPI tasks...
 MPI Rank 0 on nid00062 using Device 0 (Tesla)
 MPI Rank 1 on nid00063 using Device 0 (Tesla)
 0 101 202 303 404 505 606 707 808 909 1010 1111 
 0 101 202 303 404 505 606 707 808 909 1010 1111 
 Application 15276 resources: utime ~0s, stime ~2s


To show job status: scontrol show jobs
To cancel a job: scancel <JOBID>

For a list of the most useful SLURM commands, please have a look at the corresponding FAQ section under the User Forum. If you have any questions please contact us at help@cscs.ch