AMATH 483/583

University of Washington
Department of Applied Mathematics
AMATH 483/583		Assigned: June 01, 2019
Problem Set #7		Due: June 06, 2019

Introduction
Preliminaries
Warm Up
Problems
- Norm
- OCR
Submitting Your Work

Introduction

This assignment is extracted as with previous assignments into a directory ps7. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.

The tar ball for this assignment can be downloaded with this link.

Preliminaries

The problems for this assignment will be done on an AWS GPU-enabled instance. In addition to the GPU, the instance we are using is fairly substantial (4 CPU cores) and should be able to manage multiple students running on it at once. If it turns out we need more instances, I can spin up a few more later in the week.

You can connect to the GPU instance(s) in the same way as the CPU instance from PS6. Currently, the IP address for the instance is 34.216.48.13. It also has DNS name ec2-34-216-48-13.us-west-2.compute.amazonaws.com. I suggest putting this address into your .ssh/config file with a name such as “gpu”. The IP address of the instance we are using may change and if we add instances, the new instances will have different IP addresses. I will keep a current list of instance IP addresses on Piazza and on Canvas.

Finding out about your GPU

Once you are logged on to the instance, you can query the GPU using the command

nvidia-smi

It should print out information similar to the following:

% nvidia-smi
Fri May 31 07:14:35 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   58C    P0    73W / 149W |      0MiB / 11441MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This indicates the GPU is of the Kepler family, has 4GiB of memory, we have version 418.67 of the drivers installed, and version 10.1 of CUDA installed.

The Nvidia CUDA toolkit contains a large number of substantive examples. With the latest version of the toolkit, everything related to CUDA is installed in /usr/local/cuda; the samples can be found in /usr/local/cuda/samples. I encourage you to look through some of them – the 6_Advanced subdirectory in particular has some interesting and non-trivial examples.

Project Organization

For this assignment, your code repo is organized into separate subdirectories, one for each “project”. For a particular project, there are files that are specific to that project (the “main” file) and other files that are shared among projects.

Since one of the foundational rules of programming is to never keep duplicates of code, or files, or anything, when you don’t have to, we put all of the common files into a single subdirectory and refer to them there from each project. So if you look in the include subdirectory, you will see the files for Matrix, Vector, etc., basically, all fo the files we #include in other files (hence the name include for the subdirectory where they are kept).

A less obvious piece of common code has to do with the Makefile. Much of what we have put into our Makefiles over the quarter has been boilerplate: implicit rules for generating a .exe from a .o, compiler flags, and so forth. Each project subdirectory needs its own Makefile to control how the files are built there – but, again, we don’t want to keep around duplicates of the boilerplate. Hence, that is all kept in the include subdirectory as well in the file Makefile.inc. Makefile.inc is included into each project-specific Makefile with the statement

include ../include/Makefile.inc

I also added some tests for setting up flags in Makefile.inc so that (hopefully) the “right thing” will happen when it is used on different platforms.

The bits in each Makefile that you should be concerned with are the macros

OMP     := no
THREADS := no
PYTHON  := no
CIMG    := no

Setting any of these to “yes” will cause the appropriate flags to be added to the compilation and linking steps in the build process.

Templates

In a previous assignment we compared matrix-matrix produce for row-oriented matrices and colummn-oriented matrices (RowMatrix and ColMatrix classes, respectively). One thing you may have noticed about the matrix multiply algorithms for each class was that they were very similar.

  void multiply(const RowMatrix& A, const RowMatrix& B, RowMatrix& C) {
    for (size_t i = 0; i < A.num_rows(); ++i) {
      for (size_t j = 0; j < B.num_cols(); ++j) {
        for (size_t k = 0; k < A.num_cols(); ++k) {
          C(i, j) += A(i, k) * B(k, j);
        }
      }
    }
  }
  
  void multiply(const ColMatrix& A, const ColMatrix& B, ColMatrix& C) {
    for (size_t i = 0; i < A.num_rows(); ++i) {
      for (size_t j = 0; j < B.num_cols(); ++j) {
        for (size_t k = 0; k < A.num_cols(); ++k) {
          C(i, j) += A(i, k) * B(k, j);
        }
      }
    }
  }

So similar, in fact, that the only difference is the type of the arguments to the functions. The function bodies are identical. Moreover, they are invoked identically:

  RowMatrix A(size), B(size), C(size);
  ColMatrix D(size), E(size), F(size);
  multiply(A, B, C);
  multiply(D, E, F);  

Notice that RowMatrixand ColMatrix have the same interface – and deliberately so. A matrix has particular semantics and both RowMatrix and ColMatrix should present those. The difference between the two types has nothing to do with how they are used – the difference is simply how they internally represent the matrix.

However, despite having the same interfaces and same semantics, since RowMatrix and ColMatrix are different types, we have to have different multiply functions for them. But it would seem rather a waste of effort to have two functions with exactly the same text that differ only by the types of the arguments. Moreover, for more complicated functions, or for functions that need to support more than one type, this quickly becomes a maintenance nightmare waiting to happen – when one function is changed, all versions need to be changed. It is much easier to maintain just a single function. And, finally, what happens if we have a new matrix type with the same interface? If we want a multiply routine for it, we have to write a new one – copying and pasting the body of another one and changing the types of the parameters. Ugh.

This problem is similar to the one we saw with procedural abstraction. That is, when we have the same program text except for differing values, we parameterize the function with those values and use a single function over multiple values. To deal with this issue at a higher level, C++ enables parameterization over types with the template mechanism. Using templates, we can write a type-parameterized multiply as follows:

  template <typename MatrixType>
  void multiply(const MatrixType& A, const MatrixType& B, Matrixtype& C) {
    for (size_t i = 0; i < A.num_rows(); ++i) {
      for (size_t j = 0; j < B.num_cols(); ++j) {
        for (size_t k = 0; k < A.num_cols(); ++k) {
          C(i, j) += A(i, k) * B(k, j);
        }
      }
    }
  }

When a function is parameterized this way, it is referred to as a “function template” (not a “template function”). It is parameterized on the typename MatrixType.

We can invoke the multiply function template in the same way as before:

  RowMatrix A(size), B(size), C(size);
  ColMatrix D(size), E(size), F(size);
  multiply(A, B, C);
  multiply(D, E, F);  

And, better yet, we can call the multiply function template on any type we like, as long as it provides the interface expected in the body of the function template.

Templates are an extraordinarily powerful mechanism in C++. In fact, the template mechanism in C++ is its own Turing-complete language – one can perform arbitrary compile-time operations using it. (C++ has more recently introduced the constexpr mechanism for compile-time computation and should generally be used instead of template metaprogramming.)

There are a few things to be aware of when using function templates.

Function templates are not functions. As the name implies, function templates are, well, templates. They are not actual functions and are not compiled – by themselves they don’t have enough information to be compiled. The type parameter needs to be bound to an actual type – then the compiler has enough information to compile the function. One refers to the creation of a function when the type parameter is bound as “instantiation.”
The model for using function templates is different than for concrete functions. To use a concrete function in a program we need its declaration. The function itself can exist elsewhere and be compiled separately from where it is being called. We can’t compile a function template separately – we don’t even know what it will be until we compose it with a particular type. When the function template is composed with a particular type, then the compiler can create an actual function based on this composition (an instantiation) and compile it. Accordingly, we have
Function templates are included completely in header files. The full text of the function template needs to be present when it is intantiated, so the full text of the function template is included as its declaration and definition in the header file.
Errors in function templates only become exposed upon composition. Since the function is compiled only when it is composed with a particular type, we won’t see any errors in the function (or with the composition) until the composition occurs and the compiler attempts to compile the composed function. Although error messages are getting better (and C++ will be including methods for better checking of templates, i.e., concepts), error messages from errors in function templates are notoriously verbose.
Instantiated function templates perform exactly as the equivalent concrete function. That is, the function template multiply will give exactly the same performance when composed with RowMatrix or ColMatrix as the earlier functions specifically written for those types.

Template Specialization

“Aha, not so fast!” you might be saying. “We don’t necessarily want to use the same function for RowMatrix and ColMatrix. Different loop orderings worked better for each case.”

This is true. And the C++ template mechanism provides a feature just for handling this issue – that is, the issue of needing selectively different functionality to get better performance (e.g.) for a given type.

The file specialization.cpp in the warmup subdirectory shows how this is done. The generic function template is this

template <typename MatrixType>
void multiply(const MatrixType& A, const MatrixType& B, MatrixType& C) {
  std::cout << "Generic multiply" << std::endl;
  for (size_t i = 0; i < A.num_rows(); ++i) {
    for (size_t j = 0; j < B.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

(Note we’ve added a diagnostic print statement just for this example.)

The specialization of multiply for ColMatrix is this

template <>
void multiply(const ColMatrix& A, const ColMatrix& B, ColMatrix& C) {
  std::cout << "specialized ColMatrix multiply" << std::endl;
  for (size_t j = 0; j < B.num_cols(); ++j) {
    for (size_t k = 0; k < A.num_cols(); ++k) {
      for (size_t i = 0; i < A.num_rows(); ++i) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

If we compile this program

int main() {

  size_t N = 256;
  Matrix A(N, N), B(N, N), C(N, N);
  RowMatrix rA(N, N), rB(N, N), rC(N, N);
  ColMatrix cA(N, N), cB(N, N), cC(N, N);  // Column matrix
  multiply(A, B, C);
  multiply(rA, rB, rC);
  multiply(cA, cB, cC);                    // Column matrix

  return 0;
}

we get

$ ./specialization.exe
./specialization.exe 
Generic multiply
Generic multiply
specialized ColMatrix multiply

When deciding how to instantiate a function template, the compiler examines the types of the arguments being passed in and attempts to match to any specializations and if none are found, uses the base template. The types that are found associated with arguments are substituted for the template arguments – the types are deduced by the compiler.

Sometimes the compiler cannot deduce the type we intend to use. In that case we can invoke the function template specifically with a type, for example

multiply<ColMatrix>(cA, cB, cC);

will explicitly bind ColMatrix to the template parameter MatrixType and then look up the corresponding function. This example isn’t horribly interesting since the type can be deduced from the argument. But again, there may be cases when that isn’t possible and you will need to specify the type you want to instantiate your template with.

Finally, if you look in your repo you will notice we no longer have the various .cpp files that we had in the past. All of those functions have been converted to function templates and are now completely included in header files. If you browse through some of those files you will see how this is done. Notice that alot of redundancy has been eliminated.

Warm Up

Hello CUDA

The hello subdirectory of your repo contains a set of basic examples: the “hello_cuda” programs that were presented in lecture. The executables can be compiled by issuing “make”. Each program takes two arguments – problem size and number of trips to run in its timing experiments. NB: the size argument in these examples is the $\text{log}_2$ of the size. That is, if you pass in a value of, say, 20, the problem size that is run is $2^{20}$.

The hello subdirectory also contains a script script.bash which runs all of the hello files over a range of problem sizes and plots the results. In addition to the cuda examples, there is a sequential implementatino, an OpenMP implementation, and a thrust implementation. There wasn’t a good way to put all of these examples in a single file and create a plot from that – these examples print out ascii files and a separate python script creates the plot. Everything to do this should be installed on the AWS instance. (Unless you have a laptop with an Nvidia graphics card AND you have CUDA installed just so, it won’t be possible to do this on your laptop.)

Run the file script.bash and examine the output plot. Review the slides from lecture and make sure you understand why the versions 1, 2, and 3 of hello_cuda give the results that they do. Note also the performance of the sequential, OpenMP, and Thrust cases.

How many more threads are run in version 2 compared to version 1? How much speedup might you expect? How much speedup do you see in your plot?

nvprof

Nvidia has a wonderful tool for analyzing cuda applications: The Nvida visual profiler (nvvp). Unfortunately it runs too slowly over a forwarded X connection and isn’t really usable remotely. However, there is a command-line program (nvprof) that provides fairly good diagnostics, but you have to know what to look for.

You can see a list of the events and metrics that nvprof can provide by executing:

$ nvprof --query-events 
$ nvprof --query-metrics

Complete information about cuda programming (including a best practices guide) can be found at https://docs.nvidia.com/cuda/.

You hello subdirectory also contains a script nvprof.bash that invokes nvprof with a few metrics of interest.

Run nvprof.bash on the three hello_cuda executables.

$ bash nvprof.bash ./hello_cuda_1.exe
$ bash nvprof.bash ./hello_cuda_2.exe
$ bash nvprof.bash ./hello_cuda_3.exe

Quantities that you might examine in order to tune performance are “achieved occupancy” (how efficiently is your program using the GPU cores) and memory throughput. Compare the achieved occupancy for the three hello_cuda programs above. The occupancy should correlate well (though not perfectly) with the performance that the three programs exhibit.

Reduction

The reduction example that we looked at in class is in the subdirectory reduction. The files are copied from the cuda set of samples and slightly modified so that the program would compile in the course repo. There are seven different kernels that can be run from the program; kernels are selected by passing in the –kernel=<#> option, where <#> can be from 0 to 6. The kernels are progressively more optimized, as we showed in lecture. The comments in the code describe the optimization applied for each successive kernel.

Compile and run the reduction program and use nvprof to see if the effects of the different optimizations (as described in lecture and in the nvidia documentation) actually have the effect that is claimed.

$ nvprof ./reduction --kernel=0
$ nvprof ./reduction --kernel=1
$ ...

Documentation on the optimization process can also be found at:

https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Problems

Norm

The subdirectory norm contains skeletons for several GPU based algorithms for computing the two-norm of a vector: cuda_norm, thrust_norm, lambda_norm. The latter two use thrust for implementing and managing kernels, whereas cuda_norm is plain cuda.

Documentation for thrust can be found at

https://docs.nvidia.com/cuda/thrust/index.html

You may find it helpful to refer to the hello examples for the rest of this problem.

Complete the implementation in thrust_norm.

This should be fairly straightforward if you use the thrust function transform_reduce() and pass it the binary and unary operators that are defined. You will also need to fill in the contents of what will be the kernel (the function each thread will run for its part of the overall computation). In this case you are creating a function object (discussed in PS5).

To finish the function object, your task is to complete the member function with prototype T operator()(const T&) – that is, the function call operator of the class template square.

Recall that for a function object, when we declare an object of type square<float>, we are instantiating a function object that takes a float and returns a float. We can use that object just as we would a function or a lambda. In this example we declare unary_op as follows

 square<float>        unary_op;

and pass it into transform_reduce as an argument:

thrust::transform_reduce(X.begin(), X.end(), unary_op, init, binary_op)

Note that we could also construct the function object in place in the call to transform_reduce:

thrust::transform_reduce(X.begin(), X.end(), square<float>(), init, binary_op)

You may have already noticed this technique in some of the uses we have made of, e.g., std::sort.

Complete the implementation in lambda_norm. This is similar to thrust_norm, except we pass in a lambda rather than a function object.

Complete the implementation in cuda_norm. To make this easier to accomplish, the skeleton is set up so that the kernel stores an intermediate result into another vector (y). The final norm can be computed by having the CPU accumulate over y rather than having the GPU do it.

Once you have implemented and tested your implementations, run the script script.bash in norm. The script will plot the cuda/thrust programs as well as the sequential and OpenMP implementations.

Use nvprof (you may use the nvprof.bash from the hello subdirectory) to analyze the three cuda/thrust programs. Is it obvious from the output and/or the program timer output which kernel needs the most tuning, and what are the opportunities for tuning? Describe your observations about how well cuda_norm, thrust_norm, and lambda_norm are using the GPU (occupancy, processor usage, memory bandwidth).

(583 Only) Implement a performance optimization in one of cuda_norm, thrust_norm, or lambda_norm. The reduction example shows several possibilities, though you may design your own. Show, using nvprof output if at all possible, what the benefit is of your optimization.

OCR

The subdirector ocr contains a simple machine learning program for performing optical character recognition. The dataset used is from the 500 lines project , where the design of the neural net used in this example is also discussed. Note that much of the discussion there is also about the client-server GUI for testing and training. Our system has a simpler GUI with no client-server (though you need to compile with X windows and run the program on your local machine to use it).

Submitting Your Work

Do a make clean in your ps7 directory and then create a tar ball ps7.tar.gz containing all of the files in that directory. From within the ps7 directory:

$ cd ps7
$ make clean
$ cd ..
$ tar zcvf ps7-submit.tar.gz --exclude "./ps7/.vscode" ./ps7

If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps7 subdirectory).

Previous section:

PS6

Next section:

Final

Contents