University of Washington  
Department of Applied Mathematics  
AMATH 483/583   Assigned: April 18, 2019
Problem Set #3   Due: April 25, 2019

Contents


Introduction

This assignment is extracted as with the previous assignment into a directory ps3. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.

The tar ball for this assignment can be downloaded with this link.

Preliminaries

main()

By convention (and long since standardized), the function that is first invoked when a C++ program is run is the function named main(). In lecture and in previous assignments, we have seen a very basic use of main()

#include <iostream>

int main() {
  std::cout << "Hello World!" << std::endl;
  return 0;
}

Returning from main()

We have noted a few important things about main().

First, main() is a function just like any other; the only thing that makes it special is its name – the function with that name is the one that will be called when your program is run. As “just a function”, main has a return type (int) and so must also return a value. As with any other function, this value is returned to that which called the function. In the case of main(), there isn’t another function in your program that called main(), so the value is returned to something else. Namely, it is returned to the program that invoked your program – typically the shell (bash).

But where is that value and how do you access it? If you examine the test script verify.bash, you will notice lines that look like this:

if [ $? == 0 ] ; then
	echo "+++ $1 Passed +++"
    else
	echo "--- $1 Failed ---"
fi

These lines follow the execution of a program. The return value from the program – the return value from that program’s main() function – is accessed with the special shell macro $?.

In your ps3 directory there are two files: hello.cpp and goodbye.cpp. As supplied, they both return 0 from main(). There are rules to build hello.exe and goodbye.exe in your Makefile. Build both of them:

$ make hello.exe goodbye.exe 

Now run each of them and inspect the return value:

$ ./hello.exe 
$ echo $? 
$ ./goodbye.exe 
$ echo $? 

What is printed in response to the echo statements for each case?

Next, change goodbye.cpp so that main() returns a non-zero value (say, 1 or -1). Recompile goodbye.exe and re-run the two programs again:

$ make goodbye.exe 
$ ./hello.exe 
$ echo $? 
$ ./goodbye.exe 
$ echo $? 

What is returned in this case?

In the command-line world of Linux/Unix/etc, programs are often invoked by other programs (e.g., bash scripts) and the return values from executing programs are used to signal successul execution (return 0) or not (return a non-zero error code). Actual program output is generally returned as ASCII strings to the standard output file (in C++ this is the std::cout) and read into the standard input (std::cin).

Calling main() with arguments

Command-line programs also can take parameters on the command line. For example, if you want to list the files in the director “/usr”, you would invoke

$ ls /usr

Here, ls is simply a program (by convention, Unix/Linux programs do not have a .exe extension). If you look in the /usr/bin directory, in fact, you will see the executable file for that command

$ ls /usr/bin/ls

Shell programs like bash operate in a ‘read-execute-print’ loop (REPL). The shell takes all of characters typed on the line before you hit enter, parses it, executes the command given (the first argument on the line), and prints the result. To execute the first command, the shell searches the directories contained in the $PATH environment variable and runs the program if it is found.

You can see what your path is by printing the $PATH variable:

$ echo $PATH

Now, all of the programs that you might run (do an ls on, say, /usr/bin or any other directory in your path) have a main() function (assuming they are C or C++ programs, which most programs in Linux/Unix are). So they will return 0 or non-zero values (you can verify this by running some programs and checking $?)

$ ls /usr/bin
$ echo $?
$ ls /nosuchdirectory
$ echo $?
$ nouchcommand
$ echo $?

But, these programs are doing more than what we have been doing so far: they take arguments. That is,

$ ls /usr/bin

lists the contents of the directory /usr/bin. So, somehow, “/usr/bin” is being passed into the program – in particular, somehow, “/usr/bin” is being passed to the main() function in the ls program.

The main() function we have seen to this point doesn’t have any arguments (so it isn’t obvious how argument passing would be done). However, main() is actually an overloaded function and either of the two version of main() will be called when your program is run. (You can only have a single main() – you must use one or the either of the overloads but you can’t use both.)

The overload of main() that takes arguments is declared thusly:

int main(int argc, char *argv[]);

The first argument traditionally named argc contains the number of arguments that are passed to your main() function. The second argument is an array of strings, each string of which is one of the arguments to your program.

(Here is where we are forced to make a concession to the use of pointers. In C/C++, a pointer is the address in memory of a datum and is denoted with the * preface.)

If we have our arguments in an array of strings and we know how many entries are in the array, we can loop through the array and read and interpret each entry. In your ps3 directory is a file repeat.cpp:

#include <iostream>

int main(int argc, char *argv[]) {

  for (size_t i = 0; i < argc; ++i) {
    std::cout << "argv[" << i << "]: " <<argv[i] << std::endl;
  }

  return 0;
}

Make repeat.exe and try the following. First, execute the program with no arguments at all:

$ ./repeat.exe 

What is printed?

Now execute the program with whatever arguments you like, e.g.,

$ ./repeat.exe whatever arguments you like

What is printed?

Finally, note that “.” is your current directory. We specify “.” in front of the programs we compile locally because the shell will not find it by searching in its path. That is

$ repeat.exe

Will not find repeat.exe.

But, the directory where repeat.exe exists is part of the file system directory structure on your computer and that location can be specified other than with “.”. If you run the command

$ pwd

bash will print your current working directory. The current working directory is also stored in the environment variable cwd. So for instance, in my own case, pwd might print this

$ echo $cwd
/mnt/c/Users/al75/amath583/ps3

(your working directory may be different).

Instead of using “.” (which is a relative path), use the value of $cwd (which is an absolute path)

$ /mnt/c/Users/al75/amath583/ps3/repeat.exe

$ /mnt/c/Users/al75/amath583/ps3/repeat.exe some arguments or other

(You will need to specify the correct path for your system.)

What is printed?

Warm Up

As we develop more sophisticated programs through the rest of this quarter, we will need to pass arguments to main() – and to interpret those arguments in different ways. For example, if we want to pass in the size of a problem to run, we would put a number on the command line

$ benchmark.exe 1024

In this case “1024” would be passed to the main() function in this program. But note that it is the string “1024”, not the number 1024. For example, what happens if you try to compile the following program (main.cpp in your ps3 directory):

#include <iostream>

int main(int argc, char* argv[])  {
  int size = argv[1];
  std::cout << "size is " << size << std::endl;

  return  0;
}

To set the value of size to be a number, we have to convert (“parse”) the string in argv[1] into an integer and assign the resulting value to size.

Modify main.cpp so that it correctly assigns an integer value to size. You may find the functions atoi() or atol or stol or similar to be useful. If you do end up using a library function to do the conversion from a character string to a number, you will also need to include the appropriate header file (the header file containing the declaration for that function). </p>

There is another problem with main.cpp. Once you have made your modification, what happens if you don’t pass any argument at all to your program?

$ ./main.exe

We need to do a little bit of error checking. One can do arbitrary types of checking and error handling, but we can do a few simple things that will serve well for this course. The simplest, which is what we will do for this problem set, is to simply make sure we have the right number of arguments.

#include <iostream>

int main(int argc, char* argv[])  {
  if (argc != 2) {
    std::cerr << "Usage: " << argv[0] << " size" << std::endl;
    return -1;
  }
  int size = argv[1];  // FIXME
  std::cout << "size is " << size << std::endl;

  return  0;
}

Until we introduce some more sophisticated approaches, you should use an approach like this for programs you write that take command-line arguments. Grading scripts for such programs will check that you handle the wrong number of command-line arguments correctly. For now we assume that the arguments themselves are well-formed.

Timing

As discussed in lecture – if we are going to achieve “high performance computing” we need to be able to measure performance. Performance is the ratio of how much work is done in how much time – and so we need to measure (or calculate) both to quantitatively characterize performance.

To measure time, in lecture we also introduced a class (also available as Timer.hpp in the example code directory).

class Timer {
private:
  typedef std::chrono::time_point<std::chrono::system_clock> time_t;

public:
  Timer() : startTime(), stopTime() {}

  time_t start()   { return (startTime = std::chrono::system_clock::now()); }
  time_t stop()    { return (stopTime  = std::chrono::system_clock::now()); }
  double elapsed() { return std::chrono::duration_cast<std::chrono::milliseconds>(stopTime-startTime).count(); }

private:
  time_t startTime, stopTime;
};

To use this timer, you just need to include it (with the preprocessor) in any file where you want to use it.
To start timing invoke the start() member function, to stop the timer invoke the stop() member function. The elapsed time (in milliseconds) between the start and stop is reported with elapsed()..

To practice using the timer class, compile the program timing.exe:

#include <iostream>
#include "Timer.hpp"

int main(int argc, char* argv[]) {
  size_t loops = argv[]; // FIXME -- note size_t not int!!

  Timer T;
  T.start();
  for (size_t i = 0; i < loops; ++i)
    ;
  T.stop();

  std::cout << loops << " loops took " << T.elapsed() << " milliseconds" << std::endl;

  return 0;
}

First, to get a baseline, compile it with no optimization at all – the target is “timing0.exe” in your Makefile. How many loops do you need to run to get an elapsed time of (about) 1 second? How fast is each loop being executed?
Note that the empty statement “;” in the loop body just means “do nothing.”

Second, let’s look at how much optimizing this program will help. Compile the same program as before, but this time use the -O3 option (timing3.exe). How much did your program speed up? If it executes too quickly, again, try increasing the loop count. How much time per iteration is spent now? Does this make sense?

If you are unsure about the answer you are getting here, start a discussion on Piazza. Try to have this discussion sooner rather than later, as you will need some of the information gained for later in this assignment.

Abstraction Penalty and Efficiency

One question that arises as we continue to optimize, e.g., matrix multiply is: how much performance is available? The performance gains we saw in class were impressive, but are we doing well in an absolute sense? To flip the question around, and perhaps make it more specific: We are using a fairly deep set of abstractions to give ourselves notational convenience. That is, rather than computing linear offsets from a pointer directly to access memory, we are invoking a member function of a class (recall is just a function. Then from that function we are invoking another function in the class. And there may even be more levels of indirection underneath that function. Calling a function involves a number of operations, saving return addresses on the stack, saving parameters on the stack, jumping to a new program location – and then unwinding all of that when the function call has completed. When we were analyzing matrix-matrix product in lecture, we were assuming that the inner loop just involved a small number of memory accesses and floating point operations. We didn’t consider the cost we might pay for having all of those function calls – calls we could be making at every iteration of the multiply function. If we were making those – or doing anything extraneous – we would also be measuring those when timing the multiply function. And, obviously, we would be giving up performance. The performance loss due to the use of programing abstractions is called the abstraction penalty.

One can measure the difference between achieved performance vs maximum possible performance as a ratio – as efficiency. Efficiency is simply

Measuring maximum performance

Let’s write a short program to measure maximum performance – or at least measure performance that has the multiply-add in matrix-matrix product, but without abstractions in the way.

  size_t loops = argv[1]; // FIXME
  double a = 3.14, b = 3.14159, c = 0.0;
  Timer T;
  T.start();
  for (size_t i = 0; i < loops; ++i) {
    c += a * b;
  }
  T.stop();

To save space, I am just including the timing portion of the program. In this loop we are multiplying two doubles and adding to doubles. And we are doing this (N^3) times – exactly what we would do in a matrix-matrix multiply.

Time this loop with and without optimization (efficiency0.exe and efficiency3.exe). What happens? How can this be fixed? Again, this is a question I would like the class to discuss as a whole and arrive at a solution.

Exercises

Floats and Doubles

Most microprocessors today support (at least) two kinds of floating point numbers: single precision (float) and double precision (double). In the (distant) past, arithmetic operations using float was more efficient than with double. Moreover, the amount of chip space that would have to be devoted to computing with float was smaller than with double. Today these differences in computation time are less pronounced (non-existent); an arithmetic operation with a float takes the same amount of time as an arithmetic operation with a double.

The IEEE 754 floating point standard specifies how many bits are used for representing floats and doubles, how the are each represented, what the semantics are for various operations, and what to do when exceptions occur (divide by zero, overflow, denormalization, etc.). This standard, and the hardware that supports it (notably the original 8087 chip) is one of the most significant achievements in numerical computing. (William Kahan won the Turing award for his work on floating point numbers.)

Since a double has so much better precision than a float, and since the computation costs are a wash, why would one ever use a float these days? Interestingly, 16-bit floating point numbers are starting to be available on many architectures in order to support machine learning. Again, if the computational cost is the same, why would one ever give up that much precision?

In this exercise we are going to investigate some operations with and see what may or may not still be true (and hypothesize why).

The program float_vs_double.cpp has two sections one for single precision and one for double precision. The functionality of each section is the same, only the datatypes are different. Each section times two things: the time it takes to allocate and initialize three arrays and the time it takes to multiply two of them and set the third to the result.

The section for double looks like this, for instance (with the versions for float being very similar:

  {
    size_t dim = argv[1]; // FIXME
    Timer t;    t.start();
    std::vector<double> x(dim, 3.14);
    std::vector<double> y(dim, 27.0);
    std::vector<double> z(dim, 0.0);
    t.stop();
    std::cout << "Construction time: " << t.elapsed() << std::end;

    t.start();
    for (size_t i = 0; i < dim; ++i)
      z[i] = x[i] * y[i];
    t.stop();
    std::cout << "Multiplication time: " << t.elapsed() << std::endl;
  }

To Do For this exercise, generate four sets of numbers by running float_vs_double0.exe and float_vs_double3.exe: Single precision with optimization off, double precision with optimization off, single precision with optimization on and double precision with optimization on. Run the programs with a variety of problem sizes that give you times between 100ms and 1s (more or less).

Questions Answer the questions regarding this exercise in the file Questions.md.

Your explanations should refer to the simple machine model for single core CPUs with hierarchical memory that we developed in lecture 6.

Efficiency

For this problem we want to investigate how to properly measure a loop with just a multiply and an add in the inner loop. In particular, we want to see if that can tell us the maximum FLOPS count you can get on one core. We would like to relate that to the clock rate of the CPU on your computer and estimate how many floating point operations being done on each clock cycle. See the paragraph marked “Measuring maximum performance” above.

Once you have a meaningful answer for the maximum floating point rate of your computer, we can use this result as the denominator in computing efficiency (fraction of peak).

To Do Modify the program in float_vs_double.cpp to give you a more sensible answer.

Questions Answer the questions regarding this exercise in the file Questions.md.

Note that since we are interested in maximum efficiency, we are interested in the rate possible with fully optimized code. NB: Give some thought to the numbers that you generate and make sure they make sense to you. Use double precision for your experiments.

When we defined operator+ in lecture, we saw that it allocates and returns a Vector.

  Vector operator+(const Vector& x, const Vector& y) {
  Vector z(x.num_rows());
  for (size_t i = 0; i < x.num_rows(); ++i) {
    z(i) = x(i) + y(i);
  }
  return z;
}

One common operation in numerical linear algebra has the form $y = y + \alpha x$ where $x$ and $y$ are vectors and $\alpha$ is a scalar. Let’s consider a simpler case, where there is no $\alpha$, so we just have $y = y + x$. If we write that operation using our class and we would write

Vector x(1024), y(1024);
// fill vectors, compute, etc
y = y + x;

Now, if we expand the operators we see that we are (essentially) doing

Vector z(x.num_rows());
for (size_t i = 0; i < z.num_rows(); ++i) { // operator+ expanded
  z[i] = x[i] + y[i];
}
for (size_t i = 0; i < z.num_rows(); ++i) {  // operator= expanded
  y[i] = z[i]
}

But this isn’t what we want at all. If you were just going to write your own loop for $y = y + x$, you would just write

for (size_t i = 0; i < y.num_rows(); ++i)
  y[i] = y[i] + x[i];

The case with the operator+ in this and subsequent assignment requires allocating a new Vector as well as looping twice over y and z.

If you look at the instruction set for modern microprocessor, the addition operation that it carries out in its registers is typically of the form $y = y + x$, as a single instruction. To help compilers pick that instruction rather than creating a temporary (as we saw above in the Vector example), C (and C++) have special assignment operators. In particular, instead of writing y = y + x, one can simply write y += x. Using this operator helped simpler compilers generate the right instructions for what the programmer intended – and avoided making unnecessary copies. Modern compilers are more powerful and can usually deduce that they should compile y = y + x to the equivalent of y += x.

That’s fine for built-in types, the compiler knows about those. But what about user defined types? For the example above, even a very powerful compiler would not be able to turn the expression y = y + x into the second, more efficient, loop. Here is where the rich set of assignment operators can come to the rescue. That is, we can define operator+= for Vector and obtain the more efficient loop construct.

To Do Implement operator+= for Vector. There is already declaration in amath583.hpp and the shell of an implementation in amath583.cpp (fill in the “WRITE ME” portion).

The test program plus_equals.cpp is to verify that using operator+= is as efficient as the hand written loop. The program takes a first command line argument specifying the size of the Vector you will be taking and a second that specifies how many times the loop will be run inside of start and stop of a timer. The program then prints

Hand written: <time>
y = y + x: <time>
Operator+=: <time>

which gives times for the hand written version, the y = x + y version, and the y += x version. The printed times are per trip (note the division by trips in the code).

operator* for scalar times vector (583 ONLY)

For this exercise, complete the implementation of operator*, which applies a scalar to a and returns a new Vector that is the result of that product. The declaration and definition are in amath583.hpp and amath583.cpp respectively.

To Do As above, but with alpha * x in place of x (and alpha* x(i) in place of x(i) in the hand written loops).

Questions Answer the questions regarding this exercise in the file Questions.md.

Turning in the Exercises

Do a make clean in your ps3 directory and then create a tar ball ps3.tar.gz containing all of the files in that directory. From within the ps3 directory:

$ cd ps3
$ make clean
$ cd ..
$ tar zcvf ps3-submit.tar.gz ./ps3

If you relied on any external references, list them in a file ref3.txt and include that in your tarball as well (in the ps3 subdirectory).


Previous section:
Next section: