(Click on title slide to access entire slide deck.)
One of the most important developments over the last decade in HPC has been the emergence of GPUs as a (more or less) general purpose computing platform.
In this lecture we look at the basic structure of GPU hardware and how that imposes a particular (SIMD) programming model. We give an overview of NVIDIA’s programming environment for their GPUs: CUDA as well as a C++ library interface Thrust.
There are a few things in this lecture that perhaps bear repeating.
The beginning of the process of parallel programming is finding concurrency and partitioning our problem. The pieces of work from the partitioned problem are then given to workers to compute on. But each worker needs to know what their assigned work is. There are two ways to manage this.
The manager can specify to the worker what their part of the work is, which can either be
a. Passed to the worker as a parameter b. Contained in the data to be worked on in some way
The worker can figure out what their part of the work is. This is a necessary approach when there isn’t a manager that explicitly farms out work. In several important classes of parallelism there isn’t a manager – GPU programming and distributed memory programming, for example.
When there isn’t a manager handing out defined pieces of work, the worker needs to pieces of information in addition to the work it will be doing: How many total workers there are and a unique ordinal within that number of workers. For example if there are 10 total workers, each worker must be able to identify themselves as being 0, 1, 2, … , 9.