University of Washington |
||
Department of Applied Mathematics | ||
AMATH 483/583 | Assigned: April 18, 2019 | |
Problem Set #3 | Due: April 25, 2019 |
This assignment is extracted as with the previous assignment into a directory ps3. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.
The tar ball for this assignment can be downloaded with this link.
By convention (and long since standardized), the function that is first invoked when a
C++ program is run is the function named main()
. In lecture and in previous
assignments, we have seen a very basic use of main()
#include <iostream>
int main() {
std::cout << "Hello World!" << std::endl;
return 0;
}
We have noted a few important things about main()
.
First, main()
is a function just like any
other; the only thing that makes it special is its name – the function with that
name is the one that will be called when your program is run. As “just a function”,
main has a return type (int
) and so must also return a value. As with any other
function, this value is returned to that which called the function. In the case of
main()
, there isn’t another function in your program that called main()
, so the
value is returned to something else. Namely, it is returned to the program that
invoked your program – typically the shell (bash).
But where is that value and how do you access it? If you examine the test script verify.bash, you will notice lines that look like this:
if [ $? == 0 ] ; then
echo "+++ $1 Passed +++"
else
echo "--- $1 Failed ---"
fi
These lines follow the execution of a program. The return value from the program –
the return value from that program’s main()
function – is accessed with the special
shell macro $?
.
In your ps3 directory there are two files: hello.cpp and goodbye.cpp. As supplied,
they both return 0 from main()
. There are rules to build hello.exe and goodbye.exe
in your Makefile. Build both of them:
$ make hello.exe goodbye.exe
Now run each of them and inspect the return value:
$ ./hello.exe
$ echo $?
$ ./goodbye.exe
$ echo $?
What is printed in response to the echo statements for each case?
Next, change goodbye.cpp so that main()
returns a non-zero value (say, 1 or -1).
Recompile goodbye.exe and re-run the two programs again:
$ make goodbye.exe
$ ./hello.exe
$ echo $?
$ ./goodbye.exe
$ echo $?
What is returned in this case?
In the command-line world of Linux/Unix/etc, programs are often invoked by other
programs (e.g., bash scripts) and the return values from executing programs are used
to signal successul execution (return 0) or not (return a non-zero error code).
Actual program output is generally returned as ASCII strings to the standard output
file (in C++ this is the std::cout
) and read into the standard input (std::cin
).
Command-line programs also can take parameters on the command line. For example, if you want to list the files in the director “/usr”, you would invoke
$ ls /usr
Here, ls is simply a program (by convention, Unix/Linux programs do not have a .exe extension). If you look in the /usr/bin directory, in fact, you will see the executable file for that command
$ ls /usr/bin/ls
Shell programs like bash operate in a ‘read-execute-print’ loop (REPL). The shell
takes all of characters typed on the line before you hit enter, parses it, executes
the command given (the first argument on the line), and prints the result. To execute
the first command, the shell searches the directories contained in the $PATH
environment variable and runs the program if it is found.
You can see what your path is by printing the $PATH
variable:
$ echo $PATH
Now, all of the programs that you might run (do an ls on, say, /usr/bin or any other
directory in your path) have a main()
function (assuming they are C or C++ programs,
which most programs in Linux/Unix are). So they will return 0 or non-zero values (you
can verify this by running some programs and checking $?
)
$ ls /usr/bin
$ echo $?
$ ls /nosuchdirectory
$ echo $?
$ nouchcommand
$ echo $?
But, these programs are doing more than what we have been doing so far: they take arguments. That is,
$ ls /usr/bin
lists the contents of the directory /usr/bin. So, somehow, “/usr/bin” is being passed
into the program – in particular, somehow, “/usr/bin” is being passed to the main()
function in the ls program.
The main()
function we have seen to this point doesn’t have any arguments (so it
isn’t obvious how argument passing would be done). However, main()
is actually an
overloaded function and either of the two version of main()
will be called when your
program is run. (You can only have a single main()
– you must use one or the
either of the overloads but you can’t use both.)
The overload of main()
that takes arguments is declared thusly:
int main(int argc, char *argv[]);
The first argument traditionally named argc
contains the number of arguments that
are passed to your main()
function. The second argument is an array of strings,
each string of which is one of the arguments to your program.
(Here is where we are forced to make a concession to the use of pointers. In C/C++, a
pointer is the address in memory of a datum and is denoted with the *
preface.)
If we have our arguments in an array of strings and we know how many entries are in the array, we can loop through the array and read and interpret each entry. In your ps3 directory is a file repeat.cpp:
#include <iostream>
int main(int argc, char *argv[]) {
for (size_t i = 0; i < argc; ++i) {
std::cout << "argv[" << i << "]: " <<argv[i] << std::endl;
}
return 0;
}
Make repeat.exe and try the following. First, execute the program with no arguments at all:
$ ./repeat.exe
What is printed?
Now execute the program with whatever arguments you like, e.g.,
$ ./repeat.exe whatever arguments you like
What is printed?
Finally, note that “.” is your current directory. We specify “.” in front of the programs we compile locally because the shell will not find it by searching in its path. That is
$ repeat.exe
Will not find repeat.exe.
But, the directory where repeat.exe exists is part of the file system directory structure on your computer and that location can be specified other than with “.”. If you run the command
$ pwd
bash will print your current working directory. The current working directory is also
stored in the environment variable cwd
. So for instance, in my own case, pwd might
print this
$ echo $cwd
/mnt/c/Users/al75/amath583/ps3
(your working directory may be different).
Instead of using “.” (which is a relative path), use the value of $cwd
(which is
an absolute path)
$ /mnt/c/Users/al75/amath583/ps3/repeat.exe
$ /mnt/c/Users/al75/amath583/ps3/repeat.exe some arguments or other
(You will need to specify the correct path for your system.)
What is printed?
As we develop more sophisticated programs through the rest of this quarter, we will
need to pass arguments to main()
– and to interpret those arguments in different
ways. For example, if we want to pass in the size of a problem to run, we would put a
number on the command line
$ benchmark.exe 1024
In this case “1024” would be passed to the main()
function in this program. But
note that it is the string “1024”, not the number 1024. For example, what happens
if you try to compile the following program (main.cpp in your ps3 directory):
#include <iostream>
int main(int argc, char* argv[]) {
int size = argv[1];
std::cout << "size is " << size << std::endl;
return 0;
}
To set the value of size
to be a number, we have to convert (“parse”) the string in
argv[1]
into an integer and assign the resulting value to size
.
Modify main.cpp so that it correctly assigns an integer value to size
.
You may find the functions atoi()
or atol
or stol
or similar to be useful.
If you do end up using a library function to do the conversion from a character string
to a number, you will also need to include the appropriate header file (the header
file containing the declaration for that function). </p>
There is another problem with main.cpp. Once you have made your modification, what happens if you don’t pass any argument at all to your program?
$ ./main.exe
We need to do a little bit of error checking. One can do arbitrary types of checking and error handling, but we can do a few simple things that will serve well for this course. The simplest, which is what we will do for this problem set, is to simply make sure we have the right number of arguments.
#include <iostream>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << "Usage: " << argv[0] << " size" << std::endl;
return -1;
}
int size = argv[1]; // FIXME
std::cout << "size is " << size << std::endl;
return 0;
}
Until we introduce some more sophisticated approaches, you should use an approach like this for programs you write that take command-line arguments. Grading scripts for such programs will check that you handle the wrong number of command-line arguments correctly. For now we assume that the arguments themselves are well-formed.
As discussed in lecture – if we are going to achieve “high performance computing” we need to be able to measure performance. Performance is the ratio of how much work is done in how much time – and so we need to measure (or calculate) both to quantitatively characterize performance.
To measure time, in lecture we also introduced a class (also available
as Timer.hpp
in the example code directory).
class Timer {
private:
typedef std::chrono::time_point<std::chrono::system_clock> time_t;
public:
Timer() : startTime(), stopTime() {}
time_t start() { return (startTime = std::chrono::system_clock::now()); }
time_t stop() { return (stopTime = std::chrono::system_clock::now()); }
double elapsed() { return std::chrono::duration_cast<std::chrono::milliseconds>(stopTime-startTime).count(); }
private:
time_t startTime, stopTime;
};
To use this timer, you just need to include it (with the preprocessor) in any file
where you want to use it.
To start timing invoke the start()
member function, to stop the timer invoke the
stop()
member function. The elapsed time (in milliseconds) between the start and
stop is reported with elapsed()
..
To practice using the timer class, compile the program timing.exe:
#include <iostream>
#include "Timer.hpp"
int main(int argc, char* argv[]) {
size_t loops = argv[]; // FIXME -- note size_t not int!!
Timer T;
T.start();
for (size_t i = 0; i < loops; ++i)
;
T.stop();
std::cout << loops << " loops took " << T.elapsed() << " milliseconds" << std::endl;
return 0;
}
First, to get a baseline, compile it with no optimization at all – the target is
“timing0.exe” in your Makefile. How many loops do you need to run to get an elapsed
time of (about) 1 second? How fast is each loop being executed?
Note that the empty statement “;” in the loop body just means “do
nothing.”
Second, let’s look at how much optimizing this program will help. Compile the same
program as before, but this time use the -O3
option (timing3.exe). How much did your program
speed up? If it executes too quickly, again, try increasing the loop count. How much
time per iteration is spent now? Does this make sense?
If you are unsure about the answer you are getting here, start a discussion on Piazza. Try to have this discussion sooner rather than later, as you will need some of the information gained for later in this assignment.
One question that arises as we continue to optimize, e.g., matrix multiply is: how much performance is available? The performance gains we saw in class were impressive, but are we doing well in an absolute sense? To flip the question around, and perhaps make it more specific: We are using a fairly deep set of abstractions to give ourselves notational convenience. That is, rather than computing linear offsets from a pointer directly to access memory, we are invoking a member function of a class (recall is just a function. Then from that function we are invoking another function in the class. And there may even be more levels of indirection underneath that function. Calling a function involves a number of operations, saving return addresses on the stack, saving parameters on the stack, jumping to a new program location – and then unwinding all of that when the function call has completed. When we were analyzing matrix-matrix product in lecture, we were assuming that the inner loop just involved a small number of memory accesses and floating point operations. We didn’t consider the cost we might pay for having all of those function calls – calls we could be making at every iteration of the multiply function. If we were making those – or doing anything extraneous – we would also be measuring those when timing the multiply function. And, obviously, we would be giving up performance. The performance loss due to the use of programing abstractions is called the abstraction penalty.
One can measure the difference between achieved performance vs maximum possible performance as a ratio – as efficiency. Efficiency is simply
Let’s write a short program to measure maximum performance – or at least measure performance that has the multiply-add in matrix-matrix product, but without abstractions in the way.
size_t loops = argv[1]; // FIXME
double a = 3.14, b = 3.14159, c = 0.0;
Timer T;
T.start();
for (size_t i = 0; i < loops; ++i) {
c += a * b;
}
T.stop();
To save space, I am just including the timing portion of the program. In this loop we are multiplying two doubles and adding to doubles. And we are doing this (N^3) times – exactly what we would do in a matrix-matrix multiply.
Time this loop with and without optimization (efficiency0.exe and efficiency3.exe). What happens? How can this be fixed? Again, this is a question I would like the class to discuss as a whole and arrive at a solution.
Most microprocessors today support (at least) two kinds of floating point numbers:
single precision (float
) and double precision (double
). In the (distant) past,
arithmetic operations using float
was more efficient than with double
. Moreover,
the amount of chip space that would have to be devoted to computing with float
was
smaller than with double
. Today these differences in computation time are less
pronounced (non-existent); an arithmetic operation with a float
takes the same
amount of time as an arithmetic operation with a double
.
The IEEE 754 floating point standard specifies how many bits are used for representing
float
s and double
s, how the are each represented, what the semantics are for
various operations, and what to do when exceptions occur (divide by zero, overflow,
denormalization, etc.). This standard, and the hardware that supports it (notably the
original 8087 chip) is one of the most significant achievements in numerical
computing. (William Kahan won the Turing award for his work on floating point
numbers.)
Since a double
has so much better precision than a float
, and since the
computation costs are a wash, why would one ever use a float
these days?
Interestingly, 16-bit floating point numbers are starting to be available on many
architectures in order to support machine learning. Again, if the computational cost
is the same, why would one ever give up that much precision?
In this exercise we are going to investigate some operations with and see what may or may not still be true (and hypothesize why).
The program float_vs_double.cpp
has two sections
one for single
precision and one for double precision. The functionality of each
section is the same, only the datatypes are different. Each section
times two things: the time it takes to allocate and initialize
three arrays and the time it takes to multiply two of them and set the
third to the result.
The section for double looks like this, for instance (with the versions for float being very similar:
{
size_t dim = argv[1]; // FIXME
Timer t; t.start();
std::vector<double> x(dim, 3.14);
std::vector<double> y(dim, 27.0);
std::vector<double> z(dim, 0.0);
t.stop();
std::cout << "Construction time: " << t.elapsed() << std::end;
t.start();
for (size_t i = 0; i < dim; ++i)
z[i] = x[i] * y[i];
t.stop();
std::cout << "Multiplication time: " << t.elapsed() << std::endl;
}
To Do
For this exercise, generate four sets of numbers by running
float_vs_double0.exe
and float_vs_double3.exe
: Single precision with optimization
off, double precision with optimization off, single precision with optimization on and
double precision with optimization on. Run the programs with a variety of problem
sizes that give you times between 100ms and 1s (more or less).
Questions Answer the questions regarding this exercise in the file Questions.md.
Your explanations should refer to the simple machine model for single core CPUs with hierarchical memory that we developed in lecture 6.
For this problem we want to investigate how to properly measure a loop with just a multiply and an add in the inner loop. In particular, we want to see if that can tell us the maximum FLOPS count you can get on one core. We would like to relate that to the clock rate of the CPU on your computer and estimate how many floating point operations being done on each clock cycle. See the paragraph marked “Measuring maximum performance” above.
Once you have a meaningful answer for the maximum floating point rate of your computer, we can use this result as the denominator in computing efficiency (fraction of peak).
To Do
Modify the program in float_vs_double.cpp
to give you a more sensible answer.
Questions Answer the questions regarding this exercise in the file Questions.md.
Note that since we are interested in maximum efficiency, we are interested in the rate possible with fully optimized code. NB: Give some thought to the numbers that you generate and make sure they make sense to you. Use double precision for your experiments.
When we defined operator+
in lecture, we saw that it allocates and returns a Vector
.
Vector operator+(const Vector& x, const Vector& y) {
Vector z(x.num_rows());
for (size_t i = 0; i < x.num_rows(); ++i) {
z(i) = x(i) + y(i);
}
return z;
}
One common operation in numerical linear algebra has the form $y = y + \alpha x$ where $x$ and $y$ are vectors and $\alpha$ is a scalar. Let’s consider a simpler case, where there is no $\alpha$, so we just have $y = y + x$. If we write that operation using our class and we would write
Vector x(1024), y(1024);
// fill vectors, compute, etc
y = y + x;
Now, if we expand the operators we see that we are (essentially) doing
Vector z(x.num_rows());
for (size_t i = 0; i < z.num_rows(); ++i) { // operator+ expanded
z[i] = x[i] + y[i];
}
for (size_t i = 0; i < z.num_rows(); ++i) { // operator= expanded
y[i] = z[i]
}
But this isn’t what we want at all. If you were just going to write your own loop for $y = y + x$, you would just write
for (size_t i = 0; i < y.num_rows(); ++i)
y[i] = y[i] + x[i];
The case with the operator+
in this and subsequent assignment requires
allocating a new Vector
as well as looping twice over y
and z
.
If you look at the instruction set for modern microprocessor, the
addition operation that it carries out in its registers is typically of
the form $y = y + x$, as a single instruction. To help compilers pick
that instruction rather than creating a temporary (as we saw above in
the Vector example), C (and C++) have special assignment operators. In
particular, instead of writing y = y + x
, one can simply write y += x
.
Using this
operator helped simpler compilers generate the right instructions for
what the programmer intended – and avoided making unnecessary copies.
Modern compilers are more powerful and can usually deduce that they should
compile y = y + x
to the equivalent of y += x
.
That’s fine for built-in types, the compiler knows about those. But what
about user defined types? For the example above, even a very powerful
compiler would not be able to turn the expression y = y + x
into the
second, more efficient, loop. Here is where the rich set of assignment
operators can come to the rescue. That is, we can define operator+=
for Vector
and obtain
the more efficient loop construct.
To Do
Implement operator+=
for Vector
. There is already declaration in amath583.hpp and
the shell of an
implementation in amath583.cpp (fill in the “WRITE ME” portion).
The
test program plus_equals.cpp
is
to verify that using operator+=
is as efficient as the hand written
loop.
The program takes a first command line argument specifying
the size of the Vector
you will be taking and a second that specifies how many
times the loop will be run inside of start and stop of a timer.
The
program then prints
Hand written: <time>
y = y + x: <time>
Operator+=: <time>
which gives times for the hand written version, the y = x + y
version, and the y += x
version. The printed times are per trip (note the division by trips
in the code).
For this exercise, complete the implementation of operator*
,
which applies a scalar to a and
returns a new Vector
that is the result of that product. The declaration and
definition are in amath583.hpp and amath583.cpp respectively.
To Do
As above, but with alpha * x
in place of x
(and alpha* x(i) in place of x(i) in the
hand written loops).
Questions Answer the questions regarding this exercise in the file Questions.md.
Do a make clean in your ps3 directory and then create a tar ball ps3.tar.gz containing all of the files in that directory. From within the ps3 directory:
$ cd ps3
$ make clean
$ cd ..
$ tar zcvf ps3-submit.tar.gz ./ps3
If you relied on any external references, list them in a file ref3.txt and include that in your tarball as well (in the ps3 subdirectory).