During a recent CSCS training event on accelerator programming, one of the participants brought up a very reasonable question: Why must we continue to re-write our codes (using multiple parallel languages) for increasingly complex supercomputers?
At CSCS, we understand this frustration — it takes significant effort to learn new programming models and hardware architectures. However, looking forward, the hardware industry’s drive toward increasingly complex architectures appears to be an unflinching trend. We expect that computational scientists will need to continue developing codes using multiple levels of parallelism — distributed memory, shared memory, and SIMD — in order to best utilize systems of the future.
As an example, HPCWire reported recently on a HP Labs research project that aims to produce a 10 teraflops chipset. In order to improve memory bandwidth, the chip will use 3D chip stacking and an optical memory module. The chip, code named Corona, is expected to use 256 cores (each core supporting four simultaneous threads). Additionally, since commodity CPU components are expected to be used, an additional level of SIMD parallelism would likely exist (i.e. perhaps a wider follow-on to AVX).
Such a high level of parallelism on a single chip will present significant challenges for attainable intra-node application performance. In order to address this gap between theoretical and sustained application performance, we hope that HP Labs will invest a similar amount of effort in designing a programming model that will help developers to easily and efficiently code for such complex hardware.
At CSCS, we will continue to investigate upcoming technologies (e.g. processors, interconnects, I/O systems, programming models, libraries, etc.) and provide hardware and training for those technologies that show high performance and improved programmability.