University of Washington |
||
Department of Applied Mathematics | ||
AMATH 483/583 | Assigned: June 01, 2019 | |
Problem Set #7 | Due: June 06, 2019 |
This assignment is extracted as with previous assignments into a directory ps7. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.
The tar ball for this assignment can be downloaded with this link.
The problems for this assignment will be done on an AWS GPU-enabled instance. In addition to the GPU, the instance we are using is fairly substantial (4 CPU cores) and should be able to manage multiple students running on it at once. If it turns out we need more instances, I can spin up a few more later in the week.
You can connect to the GPU instance(s) in the same way as the CPU instance from PS6. Currently, the IP address for the instance is 34.216.48.13
. It also has DNS name ec2-34-216-48-13.us-west-2.compute.amazonaws.com
. I suggest putting this address into your .ssh/config file with a name such as “gpu”. The IP address of the instance we are using may change and if we add instances, the new instances will have different IP addresses. I will keep a current list of instance IP addresses on Piazza and on Canvas.
Once you are logged on to the instance, you can query the GPU using the command
nvidia-smi
It should print out information similar to the following:
% nvidia-smi
Fri May 31 07:14:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 58C P0 73W / 149W | 0MiB / 11441MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This indicates the GPU is of the Kepler family, has 4GiB of memory, we have version 418.67 of the drivers installed, and version 10.1 of CUDA installed.
The Nvidia CUDA toolkit contains a large number of substantive examples. With the latest version of the toolkit, everything related to CUDA is installed in /usr/local/cuda
; the samples can be found in /usr/local/cuda/samples
. I encourage you to look through some of them – the 6_Advanced
subdirectory in particular has some interesting and non-trivial examples.
For this assignment, your code repo is organized into separate subdirectories, one for each “project”. For a particular project, there are files that are specific to that project (the “main” file) and other files that are shared among projects.
Since one of the foundational rules of programming is to never keep duplicates of code, or files, or anything, when you don’t have to, we put all of the common files into a single subdirectory and refer to them there from each project. So if you look in the include
subdirectory, you will see the files for Matrix
, Vector
, etc., basically, all fo the files we #include
in other files (hence the name include
for the subdirectory where they are kept).
A less obvious piece of common code has to do with the Makefile. Much of what we have put into our Makefiles over the quarter has been boilerplate: implicit rules for generating a .exe from a .o, compiler flags, and so forth. Each project subdirectory needs its own Makefile to control how the files are built there – but, again, we don’t want to keep around duplicates of the boilerplate. Hence, that is all kept in the include
subdirectory as well in the file Makefile.inc. Makefile.inc is included into each project-specific Makefile with the statement
include ../include/Makefile.inc
I also added some tests for setting up flags in Makefile.inc so that (hopefully) the “right thing” will happen when it is used on different platforms.
The bits in each Makefile that you should be concerned with are the macros
OMP := no
THREADS := no
PYTHON := no
CIMG := no
Setting any of these to “yes” will cause the appropriate flags to be added to the compilation and linking steps in the build process.
In a previous assignment we compared matrix-matrix produce for row-oriented matrices and colummn-oriented matrices (RowMatrix
and ColMatrix
classes, respectively). One thing you may have noticed about the matrix multiply algorithms for each class was that they were very similar.
void multiply(const RowMatrix& A, const RowMatrix& B, RowMatrix& C) {
for (size_t i = 0; i < A.num_rows(); ++i) {
for (size_t j = 0; j < B.num_cols(); ++j) {
for (size_t k = 0; k < A.num_cols(); ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
void multiply(const ColMatrix& A, const ColMatrix& B, ColMatrix& C) {
for (size_t i = 0; i < A.num_rows(); ++i) {
for (size_t j = 0; j < B.num_cols(); ++j) {
for (size_t k = 0; k < A.num_cols(); ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
So similar, in fact, that the only difference is the type of the arguments to the functions. The function bodies are identical. Moreover, they are invoked identically:
RowMatrix A(size), B(size), C(size);
ColMatrix D(size), E(size), F(size);
multiply(A, B, C);
multiply(D, E, F);
Notice that RowMatrix
and ColMatrix
have the same interface – and deliberately so. A matrix has particular semantics and both RowMatrix
and ColMatrix
should present those. The difference between the two types has nothing to do with how they are used – the difference is simply how they internally represent the matrix.
However, despite having the same interfaces and same semantics, since RowMatrix
and ColMatrix
are different types, we have to have different multiply
functions for them. But it would seem rather a waste of effort to have two functions with exactly the same text that differ only by the types of the arguments.
Moreover, for more complicated functions, or for functions that need to support more than one type, this quickly becomes a maintenance nightmare waiting to happen – when one function is changed, all versions need to be changed. It is much easier to maintain just a single function. And, finally, what happens if we have a new matrix type with the same interface? If we want a multiply routine for it, we have to write a new one – copying and pasting the body of another one and changing the types of the parameters. Ugh.
This problem is similar to the one we saw with procedural abstraction. That is, when we have the same program text except for differing values, we parameterize the function with those values and use a single function over multiple values. To deal with this issue at a higher level, C++ enables parameterization over types with the template
mechanism. Using template
s, we can write a type-parameterized multiply
as follows:
template <typename MatrixType>
void multiply(const MatrixType& A, const MatrixType& B, Matrixtype& C) {
for (size_t i = 0; i < A.num_rows(); ++i) {
for (size_t j = 0; j < B.num_cols(); ++j) {
for (size_t k = 0; k < A.num_cols(); ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
When a function is parameterized this way, it is referred to as a “function template” (not a “template function”). It is parameterized on the typename MatrixType
.
We can invoke the multiply
function template in the same way as before:
RowMatrix A(size), B(size), C(size);
ColMatrix D(size), E(size), F(size);
multiply(A, B, C);
multiply(D, E, F);
And, better yet, we can call the multiply
function template on any type we like, as long as it provides the interface expected in the body of the function template.
Templates are an extraordinarily powerful mechanism in C++. In fact, the template mechanism in C++ is its own Turing-complete language – one can perform arbitrary compile-time operations using it. (C++ has more recently introduced the constexpr
mechanism for compile-time computation and should generally be used instead of template metaprogramming.)
There are a few things to be aware of when using function templates.
multiply
will give exactly the same performance when composed with RowMatrix
or ColMatrix
as the earlier functions specifically written for those types.“Aha, not so fast!” you might be saying. “We don’t necessarily want to use the same function for RowMatrix
and ColMatrix
. Different loop orderings worked better for each case.”
This is true. And the C++ template mechanism provides a feature just for handling this issue – that is, the issue of needing selectively different functionality to get better performance (e.g.) for a given type.
The file specialization.cpp
in the warmup
subdirectory shows how this is done. The generic function template is this
template <typename MatrixType>
void multiply(const MatrixType& A, const MatrixType& B, MatrixType& C) {
std::cout << "Generic multiply" << std::endl;
for (size_t i = 0; i < A.num_rows(); ++i) {
for (size_t j = 0; j < B.num_cols(); ++j) {
for (size_t k = 0; k < A.num_cols(); ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
(Note we’ve added a diagnostic print statement just for this example.)
The specialization of multiply
for ColMatrix
is this
template <>
void multiply(const ColMatrix& A, const ColMatrix& B, ColMatrix& C) {
std::cout << "specialized ColMatrix multiply" << std::endl;
for (size_t j = 0; j < B.num_cols(); ++j) {
for (size_t k = 0; k < A.num_cols(); ++k) {
for (size_t i = 0; i < A.num_rows(); ++i) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
If we compile this program
int main() {
size_t N = 256;
Matrix A(N, N), B(N, N), C(N, N);
RowMatrix rA(N, N), rB(N, N), rC(N, N);
ColMatrix cA(N, N), cB(N, N), cC(N, N); // Column matrix
multiply(A, B, C);
multiply(rA, rB, rC);
multiply(cA, cB, cC); // Column matrix
return 0;
}
we get
$ ./specialization.exe
./specialization.exe
Generic multiply
Generic multiply
specialized ColMatrix multiply
When deciding how to instantiate a function template, the compiler examines the types of the arguments being passed in and attempts to match to any specializations and if none are found, uses the base template. The types that are found associated with arguments are substituted for the template arguments – the types are deduced by the compiler.
Sometimes the compiler cannot deduce the type we intend to use. In that case we can invoke the function template specifically with a type, for example
multiply<ColMatrix>(cA, cB, cC);
will explicitly bind ColMatrix
to the template parameter MatrixType
and then look up the corresponding function. This example isn’t horribly interesting since the type can be deduced from the argument. But again, there may be cases when that isn’t possible and you will need to specify the type you want to instantiate your template with.
Finally, if you look in your repo you will notice we no longer have the various .cpp files that we had in the past. All of those functions have been converted to function templates and are now completely included in header files. If you browse through some of those files you will see how this is done. Notice that alot of redundancy has been eliminated.
The hello subdirectory of your repo contains a set of basic examples: the “hello_cuda” programs that were presented in lecture. The executables can be compiled by issuing “make”. Each program takes two arguments – problem size and number of trips to run in its timing experiments. NB: the size argument in these examples is the $\text{log}_2$ of the size. That is, if you pass in a value of, say, 20, the problem size that is run is $2^{20}$.
The hello subdirectory also contains a script script.bash
which runs all of the hello files over a range of problem sizes and plots the results. In addition to the cuda examples, there is a sequential implementatino, an OpenMP implementation, and a thrust implementation. There wasn’t a good way to put all of these examples in a single file and create a plot from that – these examples print out ascii files and a separate python script creates the plot. Everything to do this should be installed on the AWS instance. (Unless you have a laptop with an Nvidia graphics card AND you have CUDA installed just so, it won’t be possible to do this on your laptop.)
Run the file script.bash
and examine the output plot. Review the slides from lecture and make sure you understand why the versions 1, 2, and 3 of hello_cuda give the results that they do. Note also the performance of the sequential, OpenMP, and Thrust cases.
How many more threads are run in version 2 compared to version 1? How much speedup might you expect? How much speedup do you see in your plot?
Nvidia has a wonderful tool for analyzing cuda applications: The Nvida visual profiler (nvvp). Unfortunately it runs too slowly over a forwarded X connection and isn’t really usable remotely. However, there is a command-line program (nvprof
) that provides fairly good diagnostics, but you have to know what to look for.
You can see a list of the events and metrics that nvprof can provide by executing:
$ nvprof --query-events
$ nvprof --query-metrics
Complete information about cuda programming (including a best practices guide) can be found at
https://docs.nvidia.com/cuda/
.
You hello subdirectory also contains a script nvprof.bash
that invokes nvprof
with a few metrics of interest.
Run nvprof.bash
on the three hello_cuda executables.
$ bash nvprof.bash ./hello_cuda_1.exe
$ bash nvprof.bash ./hello_cuda_2.exe
$ bash nvprof.bash ./hello_cuda_3.exe
Quantities that you might examine in order to tune performance are “achieved occupancy” (how efficiently is your program using the GPU cores) and memory throughput. Compare the achieved occupancy for the three hello_cuda programs above. The occupancy should correlate well (though not perfectly) with the performance that the three programs exhibit.
The reduction example that we looked at in class is in the subdirectory reduction. The files are copied from the cuda set of samples and slightly modified so that the program would compile in the course repo. There are seven different kernels that can be run from the program; kernels are selected by passing in the –kernel=<#> option, where <#> can be from 0 to 6. The kernels are progressively more optimized, as we showed in lecture. The comments in the code describe the optimization applied for each successive kernel.
Compile and run the reduction program and use nvprof to see if the effects of the different optimizations (as described in lecture and in the nvidia documentation) actually have the effect that is claimed.
$ nvprof ./reduction --kernel=0
$ nvprof ./reduction --kernel=1
$ ...
Documentation on the optimization process can also be found at:
https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
The subdirectory norm contains skeletons for several GPU based algorithms for computing the two-norm of a vector: cuda_norm, thrust_norm, lambda_norm. The latter two use thrust for implementing and managing kernels, whereas cuda_norm is plain cuda.
Documentation for thrust can be found at
https://docs.nvidia.com/cuda/thrust/index.html
You may find it helpful to refer to the hello examples for the rest of this problem.
Complete the implementation in thrust_norm.
This should be fairly straightforward if you use the thrust function transform_reduce() and pass it the binary and unary operators that are defined. You will also need to fill in the contents of what will be the kernel (the function each thread will run for its part of the overall computation). In this case you are creating a function object (discussed in PS5).
To finish the function object,
your task is to complete the member function with prototype
T operator()(const T&)
– that is, the function call operator of the class template square
.
Recall that for a function object,
when we declare an object of type square<float>
, we are instantiating a function object that takes a float and returns a float. We can use that object just as we would a function or a lambda. In this example we declare unary_op
as follows
square<float> unary_op;
and pass it into transform_reduce as an argument:
thrust::transform_reduce(X.begin(), X.end(), unary_op, init, binary_op)
Note that we could also construct the function object in place in the call to transform_reduce
:
thrust::transform_reduce(X.begin(), X.end(), square<float>(), init, binary_op)
You may have already noticed this technique in some of the uses we have made of, e.g., std::sort
.
Complete the implementation in lambda_norm. This is similar to thrust_norm, except we pass in a lambda rather than a function object.
Complete the implementation in cuda_norm. To make this easier to accomplish, the skeleton is set up so that the kernel stores an intermediate result into another vector (y). The final norm can be computed by having the CPU accumulate over y rather than having the GPU do it.
Once you have implemented and tested your implementations, run the script script.bash
in norm. The script will plot the cuda/thrust programs as well as the sequential and OpenMP implementations.
Use nvprof (you may use the nvprof.bash from the hello subdirectory) to analyze the three cuda/thrust programs. Is it obvious from the output and/or the program timer output which kernel needs the most tuning, and what are the opportunities for tuning? Describe your observations about how well cuda_norm, thrust_norm, and lambda_norm are using the GPU (occupancy, processor usage, memory bandwidth).
(583 Only) Implement a performance optimization in one of cuda_norm, thrust_norm, or lambda_norm. The reduction example shows several possibilities, though you may design your own. Show, using nvprof output if at all possible, what the benefit is of your optimization.
The subdirector ocr
contains a simple machine learning program for performing optical character recognition. The dataset used is from the 500 lines project , where the design of the neural net used in this example is also discussed. Note that much of the discussion there is also about the client-server GUI for testing and training. Our system has a simpler GUI with no client-server (though you need to compile with X windows and run the program on your local machine to use it).
Do a make clean in your ps7 directory and then create a tar ball ps7.tar.gz containing all of the files in that directory. From within the ps7 directory:
$ cd ps7
$ make clean
$ cd ..
$ tar zcvf ps7-submit.tar.gz --exclude "./ps7/.vscode" ./ps7
If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps7 subdirectory).