Earlier this year, CSCS hosted a tutorial called “Molecular Dynamics Codes on GPGPUs” — it provided an overview of classical molecular dynamics software, showed some benchmarking results, and gave a hands-on walkthrough using GPU-accelerated MD codes.
Here, we’d like to provide a brief summary of the benchmarking results that were featured in the tutorial. We looked at two of the most popular MD codes at CSCS: NAMD (v2.7) and AMBER (v11) — both already have CUDA implementations. We were interested if the GPU-based system’s time-to-solution would be decreased, and we also wanted to see if the implementations were mature enough for production science.
- Rosa: Cray XT5, each node contains 2x 2.4 GHz AMD Istanbul (6-core), 4-12 MPI processes per node
- Fuji testbed: Infiniband, each node contains 1x Nvidia C2050 + 2x 2.93 GHz Intel Xeon E5670 (6-core), 1 MPI process per node
The implicit solvent benchmarks showed spectacular speed-ups on Fuji over Rosa for the larger problem sizes: 4 Fuji nodes ran the Nucleosome benchmark about 2.6x faster than 48 Rosa nodes. On smaller benchmarks (e.g. Trp-cage), the comparisons showed significantly less performance improvement.
The explicit solvent benchmarks showed modest performance improvements for the GPU-enabled Fuji system — the larger problem sizes showed that 4 Fuji nodes equalled approximately 10-12 Rosa nodes.
- Rosa: Cray XT5, each node contains 2x 2.4 GHz AMD Istanbul (6-core) CPUs
- Eiger: Infiniband, each node contains 2x Nvidia Fermi C2070/M2050 + 2x AMD Istanbul (or Magnycours), 12 MPI processes per node
For NAMD, the largest molecular systems tended to perform best on the GPUs — 6 Eiger nodes produced comparable performance to 24 Rosa nodes for the STMV dataset. Also, we tested NAMD’s ability to handle multiple processes sharing a only few GPUs via pipelining — we verified that this “GPU sharing” strategy does in fact give faster results.
It is possible to get impressive speed-ups using GPUs for AMBER and NAMD, but this depends heavily on the specific molecular system of interest. In general, it seems like the *largest* problem sizes perform best on GPUs — a high number of atoms per GPU are required to be efficient or else the speedup will be outweighed by the cost of the data transfers. Also, it is highly important to optimize the application configurations in order to ensure the GPUs aren’t continually transferring data instead of performing useful work.
If you are interested in using GPU-enabled molecular dynamics codes on CSCS’ systems, please let us know if you have any questions about developing hybrid accelerator codes or running multi-node GPU applications.