University of Washington  
Department of Applied Mathematics  
AMATH 483/583   Assigned: April 25, 2019
Problem Set #3   Due: May 02, 2019

Contents


Introduction

This assignment is extracted as with previous assignments into a directory ps4. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.

The tar ball for this assignment can be downloaded with this link.

Preliminaries

More make automation

Although we can use macros in a Makefile to help make the content of our rules consistent, if you read through the Makefiles in the previous assignment, you will quickly notice that there is still alot of repetition in each Makefile.

One place where there is alot of repetition is in the rule for building a .o file from a .cpp file. All of those production rules, which compile “something.cpp” to make “something.o” have the same pattern

            $(CXX) $(CXXFLAGS) -c <something>.cpp -o <something>.o 

In large programs with large Makefiles, keeping these all consistent could benefit from further automation.

To automate pattern-based production rules, make has some “magic” macros that pattern match for you – namely $< and $@. The macro $< represents the file that is the dependency (the .cpp file) and $@ represents the target (the .o file).

A generic compilation statement would therefore be

        $(CXX) -c $(CXXFLAGS) $< -o $@

That is, given a dependency rule that has a dependency file and a target file (which is what all dependency rules have), this generic rule will substitute $@ with the target file and $< with the dependency.

We can use these in a Makefile in the following way:

CXX           = c++
CXXFLAGS = -Wall -g -std=c++11 

# Old way
amath583.o: amath583.cpp
        $(CXX) -c $(CXXFLAGS) amath583.cpp -o amath583.o

# New way
amath583.o: amath583.cpp
        $(CXX) -c $(CXXFLAGS) $< -o $@

Now, every target could have that same rule.

amath583.o: amath583.cpp amath583.hpp Vector.hpp
        $(CXX) -c $(CXXFLAGS) $< -o $@

julia.o: julia.cpp 
        $(CXX) -c $(CXXFLAGS) $< -o $@

norm.o: norm.cpp Vector.hpp amath583.hpp
        $(CXX) -c $(CXXFLAGS) $< -o $@

There are a few more things we need to clean up here though. First, there is an issue with the $< macro. Namely, which dependency is the one that we will compile to make the target? Most .o files have a large number of header files that they depend on. If a recompile rule gets triggered because one of those changes, we want to compile the .cpp files that includes the .hpp file, not the .hpp file itself. If we give a rule a list of dependencies, make will pick the first file as the one to pass to $<, so as long as we put the .cpp file first in the list, it will be okay.

Another approach, and the one we will be using for reasons that will become clear, is to note that dependencies are accumulated, you don’t need to specify all of them on one line. For instance, these are two equivalent ways of expressing the same dependencies:

# First way
norm.o: norm.cpp Vector.hpp amath583.hpp 
        $(CXX) -c $(CXXFLAGS) $< -o $@

# Second way 
norm.o: Vector.hpp amath583.hpp 
norm.o: norm.cpp 
        $(CXX) -c $(CXXFLAGS) $< -o $@

# Third way 
norm.o: norm.cpp 
        $(CXX) -c $(CXXFLAGS) $< -o $@

norm.o: Vector.hpp
norm.o:  amath583.hpp 

Note that in the second and third way the production rule for the .o is more obvious. There is just one dependence. But, whenever any of the dependencies are out of date, the production rule will get fired, and since the .cpp and .o rule are right above the production rule, it is more obvious what will get bound to $<.

There is one last thing to clean up here. We still have a repetitive pattern – all we did was substitute the magic macros in. But every production rule still has:

    something.o : something.cpp
            $(CXX) -c $(CXXFLAGS) $< -o $@

There is one last piece of pattern-matching that make has and that will let us roll all of those rules up into just one: implicit rules. We express an implicit rule like this:

    %.o : %.cpp 
            $(CXX) -c $(CXXFLAGS) $< -o $@

This basically the rule with “something” above – but we only need to write this rule once and it covers all cases where something.o depends on something.cpp. We still have the other dependencies (on headers) to account for – but this can also be automated (more on that in the next assignment.

At any rate, the Makefile we started with now looks like this:

CXX           = c++
CXXFLAGS = -Wall -g -std=c++11 

# Implicit rule for building .o from .cpp
%.o : %.cpp 
         $(CXX) -c $(CXXFLAGS) $< -o $@

# Dependencies
amath583.o: amath583.hpp 
amath583.o: Vector.hpp 
norm.o: amath583.hpp 
norm.o: Vector.hpp 

This will handle all cases where something.o depends on something.cpp. Using these magic macros and implicit rules greatly simplifies what we have to put in a Makefile and it also allows us to separately specify dependencies from production rules. (Note also that the dependency of, say, amath583.o on amath583.cpp is covered by the implicit rule, so we only need to explicitly specify other dependencies, the header files).

NB: We will still continue to supply Makefiles for much of your assignments, but we will be gradually taking some of the scaffolding away in future problem sets and you will have to add rules for your own files. You can add those rules however you like, you do not have to use any of the advanced features of make in your assignments as long as your programs correctly build for the test script. However, these features were created to make the lives of programmers easier, so you will probably find it useful to familiarize yourself with these rules so that you can take advantage of them.

Warm Up

cpuinfo

As discussed in lecture, the SIMD/vector instruction sets for Intel architectures have been evolving over time. Although modern compilers can generate machine code for any of these architectures, the chip that a program is running on must support the instructions generated by the compiler. If the CPU fetches an instruction that it cannot execute, it will throw an illegal instruction error.

The machine instruction cpuid can be used to query the microprocessor about its capabilities. How to issue these queries and how to interpret the results is explained (e.g.) in the wikipedia entry for cpuid. A small program for querying your machines CPU in this way is provided in “cpuinfo583.cpp”. The program interacts directly with the CPU using assembly language instructions (in particular the cpuid instruction).

Compile and run cpuinfo583.exe (there is a rule for it in your Makefile).

$ make cpuinfo583.exe
$ ./cpuinfo583.exe

You will get a listing that shows a selection of available capabilities on your own machine. Run this command and check the output. The particular macros to look for are anything with “SSE”, “AVX”, “AVX2”, or “AVX512”. These support 128-, 256-, 256-, and 512-bit operands, respectively.

What level of SIMD/vector support does the CPU your computer provide?

What is the maximum operand size that your computer will support?

Overheads

The defensive programming technique assert does a run-time check of program invariants that must always be true. We have seen some places where those checks could be really useful, such as making sure that matrix and vector dimensions are compatible before operating with them or checking that indices into matrices and vectors are within bounds.

The Matrix.cpp and amath583.cpp files supplied in this problem set include assert statements in these places. To see the effect of having assert in place, run the script “overhead.bash”. The script will compile a simple timing program that does matrix-matrix product and display the times for matrix sizes from 16 to 128. It adjusts the number of times it runs the matrix multiply so that the total time for each size is about one second (at least on my laptop).

The script runs this program with no optimization and with -O3. With no optimization the program is really slow, so it’s actually commented out in the script.

  1. Run the script overhead.bash and record the times reported.
  2. Add the directive #define NDEBUG at the top of overhead.cpp.
  3. Rerun the script and compare the times. You should see a marked improvement in execution time.

When creating a program we want to have at least two modes: development (or debug) and production. The former should be maximally defensize while the latter should be maximally optimized.

Now, one could disable assert by going to each source file and inserting or removing the directive when switching from one mode to the other. But for a large program, this is going to be tedious and error-prone. Instead, we want to pass the definition as a flag to the compiler. If we use the flag -DNDEBUG as part of the compilation command, the macro NDEBUG will be defined prior to the start of the compilation process. In fact, it is equivalent to having

#define NDEBUG

at the top of every file. Thus, we can turn on and turn off NDEBUG simply by changing how we compile our programs. Moreover, if we add -DNDEBUG to our OPT flags in our Makefile, it will be automatically used in any rule that includes $OPT – which, with our magic implicit spells, is all of the rules.

It gets even better. Even though changing between modes is just matter of editing some flags in a Makefile, that can still be tedious and error-prone. The make program allows you to set Makefile macros from the command line – what flags get passed in can be scripted. In fact, if you examine overhead.bash, you will see these make commands:

make clean
make overhead.exe OPTS="-O0"

make clean
make overhead.exe OPTS="-O3"

In the first case, the macro OPTS is set to -O and in the second case it is set to -O3.

Add -DNDEBUG to the optimization flags in your Makefile

Some Matrix Arithmetic

The files amath583.hpp and amath583.cpp respectively include declarations and skeletons for Frobenius_norm and operator-(). Fill these in to do the obvious things.

With these two operations defined, it is possible to check if two matrices are (more or less) equal:

  Matrix A(N, N), B(N, N);
  double epsilon = Frobenius_norm(A - B);

If epsilon is suitably small, then we can say A and B are equal. However, the definition of “suitably small” is highly context dependent. One approach is to take epsilon as machine epsilon (usually times the size of the matrix). And maybe multiply by another factor of 10 just to be sure.

Problems

Data Access Patterns Matter

The mathemaatical formula for matrix-matrix product is

which we realized using our Matrix class as three nested loops over i, j, and k.

In the function we have been considering, we have been using the “ijk” ordering, where i is the loop variable for the outer loop, j is the loop variable for the middle loop, and k is the loop variable for the inner loop:

void   matmat(const Matrix& A, const Matrix& B, Matrix& C) {
  for (size_t i = 0; i < C.num_rows(); ++i) {
    for (size_t j = 0; j < C.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
	C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

However, the order in which we take the inner summation does not depend on i, j, or k; we can do the inner summation in any order. That is, we can nest the i, j, k loops in any order, giving us six variants of the matrix-matrix product algorithm. For instance, an “ikj” ordering would have i in the outer loop, k in the middle loop, and j in the inner loop.

Implement ikj and jki.

Using the function matmat in amath583.cpp as a pattern, create two new functions: matmat_ikj and matmat_jki. Make sure to put the declarations in amath583.hpp and the implementation in amath583.cpp. (There are no skeletons for these functions, but there are some comments about where to put the definitions and declarations in the files).

Verify that these functions each correctly compute the matrix-matrix product by building and running test_order.exe.

Once you have verified the correctness of your functions, build and run the program ordering.exe. You will need to insert calls to the functions you wrote in the appropriate places.

Answer the questions regarding this exercise in the file Questions.md. ### Orientation When mapping a two-dimensional (doubly-indexed) object such as a Matrix to linear memory in a straightforward ways, there are two choices: row-major or column major (cf Lecture 5). Traditionally C (and its derivatives) have stored multi-dimensional data in row-major order, whereas Fortran (and its derivatives) have stored multi-dimensional data in column-major order (cf Lecture 5). I claim that for every ordering of a matrix-matrix product algorithm for a row-oriented matrix, there is an equivalent (in terms of data access order and thus performance) of a column-oriented matrix. To help carry out some experiments with this, you have been provided with two implementations: RowMatrix.hpp and ColMatrix.hpp. The RowMatrix class should be identical to the Matrix class we have been developing in class. The ColMatrix class was also briefly mentioned in lecture. <p class="q"> For each of the ijk, ikj, and jki orderings for a row-oriented matrix-matrix product, what are the equivalent orderings for a column-oriented matrix-matrix product? (Answer this question in Questions.md.)

Pick one of ijk, ikj, or jki and implement the matrix-matrix product for it in the file amath583_rvc.hpp. You just need to implement it in the overload of matmat for RowMatrix. Then, implement the equivalent version for the ColMatrix overload. Compile and run the program row_vs_col.exe. The performance plots should be about the same.

Remark on Laziness

If you look at some of the code we have been implementing, alot of it is very similar. However, since we have different types, we have to implement different functions. Look in particular at the two randomize overloads in amath583_rvc.hpp. Except for the type of the arguments, the functions are exactly the same. This is a horrible waste of time and effort. And, over the long run introduces opportunities for errors. The functions are the same although the types are different because both types (RowMatrix and ColMatrix) have the same interface.

Imagine also if you wanted single precision and double precision matrices in addition to row and column oriented. (Or single and double precision complex matrices). If these all had the same interface, the code for each of these different data types would be exactly the same.

Just as we were able to reuse code by parameterizing on values, C++ allows us to further reuse code by parameterizing on types via its (famous? infamous?) template feature. More on that as we move through the quarter, but you should start developing an eye (and a distaste) for duplicated code.

Do Data Access Patterns Matter? (AMATH 583 Only)

As with matrix-matrix product, we would like to get better performance from matrix vector product than with just the naive algorithm. This exercise will investigate some potential mechanisms.

In analogy to the previous exercise, create two matrix-vector functions matvec_ij and matvec_ji with declarations and definitions in amath583.hpp and amath583.cpp, respectively. The difference between these two functions is that the first should iterate over the vector (over j in the inner loop, while the second should iterate over the vector (over j) in the outer loop.

Insert the calls to these functions in matvec.cpp and compile and run the program matvec.exe. The program will create a graph matvec.png. The default sizes for the matrix should be fine for this problem but you can pass the max size on the command line.

Again, using what you know about memory hierarchy, SIMD/vectorization, CPU operation, and data layout for row-major matrices, explain the behavior you are seeing (Answer this question in Questions.md.)

Vector of Vectors (Extra Credit)

There are other choices in creating a two-dimensional data structure than the index mapping we have done in Matrix (and in RowMatrix and ColMatrix). For example, we could create a one-dimensional array of one-dimensional arrays (or std::vector:

  std::vector<std::vector<double>> storage_;

For this problem you are to complete the implementation of VoVMatrix, including defining the storage, accessors into the storage, and the storage constructor. There is a big hint about how to define the storage right above. For accessor, recall that the storage is a vector of vectors. If you index into the outer vector, the storage_ member (which is done using operator[]) you will get back a (reference to a) vector. You can then index into that. There is no need to “materialize” the intermediate vector since it is a reference – simply concatenate the calls to operator[]. The constructor may take a little thinking, but recall that there is an overload for std::vector constructor, which takes an initial value. Since storage_ is a vector – which we want to contain another vector at each index – what should the initial “value” of each entry be?

Next, define an overload for matmat in amath583_vov.hpp. Again, recall that the interface for VoVMatrix is the same as for Matrix.

Finally, complete the program in vov.cpp by inserting the appropriate function calls in the indicated places. Compile and run the program vov.exe.

Explain any differences between what you see for VoVMatrix and what you see for Matrix (in Questions.md).

As an extra bonus, experiment with different loop orderings.

Does VoVMatrix respond in the same way for different loop orderings as does Matrix? (In Questions.md)

Submitting Your Work

Do a make clean in your ps4 directory and then create a tar ball ps4.tar.gz containing all of the files in that directory. From within the ps4 directory:

$ cd ps4
$ make clean
$ cd ..
$ tar zcvf ps4-submit.tar.gz --exclude "./ps4/.vscode" ./ps4

If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps4 subdirectory).


Previous section:
Next section: