GPU Programming Course, CSCS Manno, 29-31 March 2011

Days 1-2

Michael Wolfe, PGI

Part I. CPU Architecture vs.GPU Architecture »

 CPU Architecture BasicsMulticore and multiprocessor basics
 GPU Architecture basics
 How is the GPU connected to the host?
 Why is parallel programming for GPUs different than for multicore?
 What is a GPU thread and how does it execute?
 How can I identify my GPU?

Part II. CUDA C and Fortran »

1. Low-level GPU Programming and CUDA

  • How does a program run on the GPU?

  • What kinds of parallelism is appropriate for a GPU?

  • The CUDA programming model

  • Host code to control GPU, allocate memory, launch kernels

  • Kernel code to execute on GPU

  • Scalar routine executed on one thread

  • Launched in parallel on a grid of thread blocks

2. The Host Program

  • Declaring and allocating device memory data

  • Moving data to and from the device

  • Launching kernels


SHORT LAB

3. Writing Kernels

  • What is allowed in a kernel vs. what is not allowed

  • Grids, Blocks, Threads, Warps


4. Building and Running CUDA Programs

  • Compiler options

  • Running your program

  • The CUDA Runtime API

  • CUDA Fortran vs. CUDA C


LAB

5. Performance Tuning, Tips and Tricks

  • Measuring performance, using cudaprof

  • Optimizing your kernels

  • Optimize communication between host and GPU

  • Optimize device memory accesses, shared memory usage

  • Optimize the kernel code

    • loop unrolling
    • thread block unrolling
    • grid unrolling
    • pipelining

  • Debugging


PERFORMANCE LAB

Part III. PGI Accelerator Model »

1. High-level GPU Programming using the PGI Accelerator Model

  • What role does a high-level model play?

  • Basic concepts and directive syntax

  • Accelerator compute and data regions

  • Appropriate algorithms for a GPU


2. Building and Running Accelerator Programs

  • Command line options

  • Enabling compiler feedback


SHORT LAB

3. Accelerator Directive Details

  • Compute regions

  • Clauses on the compute region directive

  • What can appear in a compute region

  • Obstacles to successful acceleration

  • Loop directive

  • Clauses on the loop directive

  • Loop schedules

  • Data regions

  • Clauses on the data region directive


LAB

4. Interpreting compiler feedback

  • Using pgprof source browser

  • Hindrances to parallelism

  • Data movement feedback

  • Reading kernel schedules


LAB

5. Performance Tuning, Tips and Tricks

  • Appropriate algorithm

  • Optimizing data movement between host and GPU

  • Data regions, mirrored / reflected data, CUDA data

  • Optimizing kernel performance

  • Tuning the kernel schedule

    • unroll clauses

  • Choosing accelerator device

  • PGI Unified Binary

  • Performance profiling information

  • GPU initialization time on Linux


PERFORMANCE LAB

Part IV. Wrapup, Questions »

1. Accelerators in HPC

  • Past, present, future role of accelerators in HPC

  • Past, present, future of programming models for accelerators

Day 3

Sadaf Alam, Jeff Poznanovic, and Tim Robinson, CSCS

Molecular Dynamics Codes on GUGPUs »

  • Introduction of GPGPU technologies for scientific computing

  • Overview of parallel classical molecular dynamics software

  • Evolution of GPU acceleration for classical molecular dynamics software

  • Walkthrough using GPU accelerated NAMD / pmemd (Rosa vs. Eiger)

  • Demo with Case studies

  • Tips and tricks for optimal usage of GPU accelerated simulations

  • Advanced topics and future outlook- LAB session