University of Washington |
||
Department of Applied Mathematics | ||
AMATH 483/583 | Assigned: May 23, 2019 | |
Problem Set #6 | Due: May 30, 2019 |
This assignment is extracted as with previous assignments into a directory ps6. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.
The tar ball for this assignment can be downloaded with this link.
Scaling tests for the last three assignments in the course ( PS6 , PS7 , and the Final ) will be done using a server instance on AWS. Instructions for accessing AWS can be found here.
These machines will have substantially more cores (16 to 32) than your laptop and can therefore support better scaling studies. Note however that these machines are shared resources. For PS6 and PS7 you should be able to complete the assignments on a shared instance. But, be considerate of your fellow students and try to limit the size and duration of your runs while you are testing and debugging.
NB: If you have access to another machine with 16 or so cores, you may use that instead of AWS.
On the AWS instance there is a public directory “/nfs/home/public” where I will put the assignments as well as auxiliary data for assignments. Some of the data files we will be using are fairly large. I recommend not copying them to your home directory, but rather reading them directly from the public directory.
For example, if you were running a the program scale.exe and wanted to use the file cit-Patents.mmio from the public directory, you could run
$ ./scale.exe -i /nfs/home/public/ps6/data/cit-Patents.mmio
Assuming the above path is where the file is and scale.exe used the -i option. The point is to access large files directly from the public repo rather than copying them.
If you do want to just use a local name for the file rather than the full path, you can create a symbolic link:
$ ln -s /nfs/home/public/ps6/data/cit-Patents.mmio .
A symbolic link is a reference to the actual file rather than a copy, and can be accessed and treated as if it were the actual file.
$ ./scale.exe -i cit-Patents.mmio
Besides some specific files in the public data directory, there are two sets of files, each set of which has the same structure, but different scales. The “Laplace” files are sparse matrices from 2D discretized Laplace’s equation and the “graph500” files are synthetically generated sparse matrices that supposedly mimic social network behavior.
The files are in binary format so that they can be loaded much more quickly than text-based files into your programs. If you want to see a little bit of information about each file (rows, columns, number of nonzeros), the first two lines are in text:
$ head -2 cit-Patents.mmio
The numbers in the second line are, respectively, the number of rows, the number of columns, and the number of nonzeros.
The default compiler in Mac OS will support OpenMP but it has to be invoked in a special way. The typical way (with g++ and with clang on Linux) to invoke the compiler with OpenMP is to add the flag -fopenmp
. If you just use the flag that way with Mac OS clang it will complain with an error message
clang: error: unsupported option '-fopenmp'
To get clang to accept the option, you need to add the -Xpreprocessor
flag, e.g.,
c++ -Xpreprocessor -fopenmp -c hello_omp.cpp -o hello_omp.o
That will be enough to get your program to compile into a .o
file, but you also need to specify the OpenMP libraries.
c++ -Xpreprocessor -fopenmp hello_omp.cpp -o hello_omp.exe -lomp
Since Mac OS does not readily support OpenMP compilation, it also doesn’t include the necessary libraries for OpenMP. To install those, you need to invoke brew
(or your favorite package manager) to install them.
brew install libomp
Once you have done that, you should be able to compile your OpenMP programs. I recommend adding (or making sure they are already added) the compilation and linking flags to your Makefile.
Another alternative for Mac OS X is to install g++ and use that. However, I don’t necessarily recommend having two installed compilers in your system.
I have tested out compiling for OpenMP with Ubuntu 18.04 – native and under WSL. Both clang and g++ support the -fopenmp
flag. If you get an error message at link time complaining about undefined symbols (which symbols look OpenMP related), then you will also need to install the OpenMP libraries. Under Ubuntu you can do this with:
sudo apt-get -y intall libomp5
You may also want to check with apt search
to see what version of OpenMP libraries are available to install. But the version 5 worked for me (at least for a few tests).
OpenMP should already by installed on the AWS instance.
For command-line programs (like the ones we have been writing in this
course), the user of a program can pass parameters to it on the command
line and they are “automagically” put into the argv
array. This is
obviously a very useful way for a user of a program to state their
intentions to the program. Often however, the program itself needs to
find out information about the environment in which it is executing –
and this may be information that the user might not know or care about
or that would be inconvenient to pass on the command line every time a
program is run.
To enable a program to obtain information about its environment, most
operating systems have a simple look up mechanism that associates
variable names with values (though these values are also just strings).
One environment variable that we have discussed previously in the class
was the status variable (i.e., the return value from a main()
function).
From the bash
command line, if we want to retrieve the value
associated with an environment variable (if we want to evaluate it) we
prepend a $
sign to it. In bash
the status variable was retrieved
with $?
, though in the csh
shell, it is retrieved with $status
. As
with program variables, evaluating a variable is not the same as
printing it out. If you just type
$ $?
at the command line, you will get the cryptic response
0: Command not found
because the shell evaluated \$?
, which evaluated to 0
. This in turn
was evaluate by bash just as if we had typed in
$ 0
and since 0
is (probably) not an executable command, the shell prints
its error message. If you want to examine the value of an environment
variable in the shell, you need to pass that value to a program that
will just print its argument back out. In most Unix like environments,
this command is echo
:
$ echo $?
0
$ echo $SHELL
/bin/bash
Now, there are actually two overlapping namespaces in most command-line
shells: the environment variables and the shell variables. These
variables are accessed in the same way, with the $
prefix – but only
the environment variables are available to running programs. The shell
variables are used only by the shell program itself.
To see a list of all of the environment variables, issue the command:
$ env
Note that this is an executable program, not a built-in shell command, so it will retrieve and print the list of environment variables that are accessable by all programs. Try this yourself. You should see a couple dozen lines that look like this (I am just showing a few):
HOSTNAME=259e82e56952
SHELL=/bin/bash
TERM=xterm
HOME=/home/amath583
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/home/amath583
SHLVL=1
These variables in the environment are not special in any particular
way. Again, the environment just maintains an association between
variable names (e.g., HOME
) and a value (e.g., /home/amath583
).
There are conventions established about certain environment variables
are used by different programs. And, some programs will update the
environment. For instance if you query the value of PWD
:
$ cd /
$ echo $PWD
/
$ cd
$ echo $PWD
/home/amath583
so the name of the current working directory is updated whenever you change directories.
Another important environment variable is the path (PATH
). This
contains a colon separated list of directories. When you enter a command
on the command line, the shell will search for a program that matches
the name you typed in each of those directories. If it finds a match, it
will execute that program (using fork()
and exec()
as we have
discussed in lecture). If it does not find a match it will return an
error message (and set the status to a non-zero value).
But how does the bash shell – which is just a program – get the value of the environment variables? The C++ standard library provides a function for looking up environment variables.
Disclaimer. As with main()
when passed arguments, the function for
querying the environment deals with character pointers. I will first
explain the function itself and then give you a wrapper function that
you can use so that you don’t have to actually deal with pointers.
The function for querying the is std::getenv()
, prototyped as follows:
char* std::getenv(char *str);
The function will look for a variable whose name matches the character
string passed in as str
.
NB: This function comes from the standard C library. In C a character string is a null-terminated sequence of characters. Character strings in C are passed by reference in some sense by passing the memory address of the first character in the string. Note that there is no specification about the size of the string. The string is demarcated by the beginning (the pointer) and the end (a special character) and can be of arbitrary length. This is the source of innumerable security exploits.
If there is a match, between the string pointed to by str
and a variable in the
environment, getenv
t will return a pointer
to the corresponding value in the environment.
If a match is not found – and
this another one of the appalling characteristics of pointers,
it will indicate an error, but will indicate that error by setting the value of the pointer to
a special value, namely NULL
(or zero). NULL pointers are no end of
trouble and have probably set the software community back a generation.
And they are a problem (this a general problem with pointers – but NULL
is the most common) because when a programmer has a pointer they want to
de-reference it. If for some reason the pointer is NULL
or some other
invalid value, the program will throw an exception – a segmentation
violation – and it will abort. Since pointers can be manipulated just
like any other variable in C, it is easy to give one an invalid value.
Moreover, as with getenv
, the special (and dangerous) value of NULL is
often used to indicate an error condition for function calls that return
pointers. If that value is not checked, and then later used, the rare
occasion that there is a failure in that call will result in a
segmentation violation.
If you are curious, try this program for yourself
int main() {
char* ptr = 0;
return *ptr;
}
So that you can avoid dealing with pointers (and you should), I provide the following wrapper.
#include <iostream>
#include <cstdlib>
#include <string>
std::string getenv_to_string(const char *in) {
char *gotten = std::getenv(in);
if (NULL == gotten) {
return std::string("");
} else {
return std::string(gotten);
}
}
std::string getenv(const std::string& in) {
return getenv_to_string(in.c_str());
}
Just pass in a C string literal (pointer to char) or a std::string
and
it will return a string.
Experiment with executing env
and examining some of the variables
individually. Create some new variables and examine their values. In a
separate shell window, you might try the following (as an example of
what to be careful about):
What do you get when you issue the command echo $PATH
?
What happens when you issue the command export PATH=""
?
What happens when you issue the command echo $PATH
following the
command above?
^D
may come in handy after the above.
As we have seen in lecture, OpenMP provides a compiler directive based
mechanism for parallelizing a job for shared-memory execution. One of
the principles upon which OpenMP is based is that the program source
code does not need to be changed – the compiler, as controlled by OpenMP directives (#pragma omp
statements), manages all concurrency. At the same time,
there are parameters about the running parallel program that we would
like the parallel program to be aware of. For example, we might want to
specify the number of threads that a program uses to execute.
Now, with the OpenMP philosophy, the program itself can be both
sequential and parallel – and it should largely be “unaware” of its
parallelization. (This should be true for any parallel program –
concerns should be well-separated.) Rather than using command-line
arguments as a way to pass information to OpenMP, OpenMP uses
environment variables. Environment variables also have the advantage
(compared to command-line arguments) that they can be accessed from any
arbitrary part of the program – including from a library. You can find a
list of environment variables that OpenMP uses on the OpenMP web site or
the OpenMP quick reference (http://bit.ly/2pCSigX). The most important
environment variable is OMP_NUM_THREADS
, which sets the (maximum)
number of threads that OpenMP will use at run-time. NB: It is
important to remember that environment variables don’t actually do
anything. Setting the value of OMP_NUM_THREADS
to some value doesn’t
automatically parallelize your program or cause OpenMP to have that
many threads. Rather it is a variable whose value any program can read –
it is up to the program what action to take (if any) in response to the
value it reads.
NB There may be an
incompatibility between clang and the cmath
include file – this problem
will only show up if you compile with the -Ofast
compilation flag in
addition to -fopenmp
. (The conflict has to do with recently
introduced SIMD commands in OpenMP 4.0.) If you get a page full of
errors for what should be a correct program, change -Ofast
to -O3
.
When a program is compiled with OpenMP, it can be run just as with any other program – execute it on the shell command line.
When running a parallel program, there are several pieces of information we would like to have. For example, in the last assignment, we partitioned our work into pieces and let a thread or task execute each piece (in parallel). For that assignment, we specified the number of threads / tasks to use so that we could experiment with different numbers of threads. In general, however, we don’t want the user to just arbitrarily set the number of threads that a job runs with. We might not want to run with more threads than there are cores available, for example. Or, we might want to set some resource limit on how many threads are used (which might be more or less than the number of cores). That is, we want to let the runtime system have some control over how our parallel program runs (and doing this automatically is in the spirit of OpenMP).
Consider the following program that inspects the OpenMP environment:
int main() {
std::string envName = "OMP_NUM_THREADS";
std::cout << envName << " = " << getenv(envName) << std::endl;
std::cout << "hardware_concurrency() = " <<
std::thread::hardware_concurrency() << std::endl;
std::cout << "omp_get_max_threads() = " <<
omp_get_max_threads() << std::endl;
std::cout << "omp_get_num_threads() = " <<
omp_get_num_threads() << std::endl;
return 0;
}
Note that we are using the getenv()
wrapper we presented earlier.
When I run this in my shell I get the following output (the numbers might vary depending on your hardware):
./a.out
OMP_NUM_THREADS =
hardware_concurrency() = 4
omp_get_max_threads() = 4
omp_get_num_threads() = 1
In this case, OMP_NUM_THREADS
is not set so OpenMP uses default
values. However, we have to be careful how we interpret this. The first
line shows that OMP_NUM_THREADS
is not set. The second line is
obtained with a call to the C++ standard library (outside of OpenMP) and
shows that the available hardware concurrency (number of cores) is equal
to four. The next two lines were obtained with calls to functions in the OpenMP
library. They show that the maximum available number of OpenMP threads
is equal to four. The final line, however, may seem curious – it says
that the number of threads is equal to one.
If we take a step back, though, number of threads equal to one makes
sense in the context where it is called. Recall from lecture that OpenMP
is based on the fork-join model. It runs sequentially between OpenMP
directives, the code blocks following which are run in parallel. In
main()
where the call to omp_get_num_threads
, there is only one
running thread – there are no other threads running concurrently with
it. But, how many threads actually run at one time with OpenMP? And how
can we tell?
Let’s add a simple function to the above:
int omp_thread_count() {
int n = 0;
#pragma omp parallel reduction(+:n)
n += 1;
return n;
}
This function will create the default number of threads and each one of
those will increment the variable n
by one – the final value of n
will be the number of threads that executed in parallel. If we add a
call to omp_thread_count()
to the above program we obtain as output:
./a.out
OMP_NUM_THREADS =
hardware_concurrency() = 4
omp_get_max_threads() = 4
omp_get_num_threads() = 1
omp_thread_count() = 4
And, indeed, the number of threads that were run in parallel is four.
Another approach would be to call omp_get_num_threads()
from inside a parallel scope:
int omp_thread_count() {
int n = 0;
#pragma parallel
{
int tid = omp_get_thread_num();
if (0 == tid) {
n = omp_get_num_threads();
}
}
return n;
}
One advantage as a programmer to using OpenMP is that parallelization can be done in line. When we parallelized programs explicitly using threads and tasks in C++, we had to bundle the work we wanted to be done into a separate function and launch a thread or task to execute that function. With OpenMP we can parallelize directly in line. For example, we can create a parallel “Hello World” as follows:
int main() {
std::cout << "omp_get_max_threads() = " <<
omp_get_max_threads() << std::endl;
std::cout << "omp_get_num_threads() = " <<
omp_get_num_threads() << std::endl;
#pragma omp parallel
{
std::cout << "Hello! I am thread " << omp_get_thread_num() <<
" of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
std::endl;
}
return 0;
}
There is no helper function needed.
But, OpenMP is not a panacea. Underneath the covers, OpenMP runs threads. This is the output from one run of the above program:
$ ./a.out
omp_get_max_threads() = 4
omp_get_num_threads() = 1
Hello! I am thread Hello! I am thread 0 of 4
Hello! I am thread 2 of 4
My C++ std::thread id is 140325565663104
My C++ std::thread id is 140325592336192
3 of 4
My C++ std::thread id is 140325561464768
Hello! I am thread 1 of 4
My C++ std::thread id is 140325569861440
The first thing to notice is that we can get the thread id for the current thread with C++ standard library mechanisms. Problematically though, note that the strings that are being printed are interleaved with each other. We have seen this before in multithreading – when we had race conditions.
Just as with the explicit threading case, OpenMP can have race
conditions when shared data is being written. However, as with the
mechanisms for parallelization, the OpenMP mechanisms for protecting
data are also expressed as #pragma omp
directives. We can fix this
race by adding a #pragma omp critical
. Note that this is more general
than the mutex approach we had before. We are just telling OpenMP that
there is a race – not how to protect it.
int main() {
std::cout << "omp_get_max_threads() = " << omp_get_max_threads() << std::endl;
std::cout << "omp_get_num_threads() = " << omp_get_num_threads() << std::endl;
#pragma omp parallel
#pragma omp critical
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " <<
omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
std::endl;
}
return 0;
}
The code for this example is contained in the file hello_omp.cpp
. In
this example, there are three commented lines with #pragma omp
directives.
// #pragma omp parallel
// #pragma omp parallel num_threads(2)
// #pragma omp critical
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " <<
omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
std::endl;
}
Experiment with this program a little bit until you get a feel for how OpenMP is managing threads and a feeling for how the OpenMP concepts of threads etc map to what you already know about threads. You might try the following:
Uncomment the first line. Try different values of OMP_NUM_THREADS
.
You should see different numbers of “Hello” messages being printed.
What happens when you have more threads than hardware concurrency?
Are the thread IDs unique?
With the first line uncommented, see what the difference is between having the second line uncommented or not.
Comment out the first line and uncomment the second. How many
threads get run? What happens when you have different values of
OMP_NUM_THREADS
?
Finally, OpenMP is a little sneaky. Although it appears to operate in the form of preprocessor directives – and removing those preprocessor directives from the program will disable OpenMP – the directives can still interoperate with the program. To see what I mean by that, change the line #pragma omp parallel num_threads(2)
to #pragma omp parallel num_threads(user_num)
where user_num
is an int
. I.e.,
size_t user_num;
if (argc >= 2) user_num = std::stol(argv[1]);
#pragma omp parallel num_threads(user_num)
{
#pragma omp critical
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
}
}
Does the program compile when using a program variable in the directive? (Hint: Yes.)
Run your program, passing in different (integer) values on the command line. What has precedence, the variable or OMP_NUM_THREADS
regarding how many threads are actually run?
In problem set 5 we explored parallelizing some functions for computing the Euclidean
norm of a vector. In this assignment we are going to look at how to use OpenMP to parallelize that computation. The code for this program is in the file norm_scale.cpp
. There are five different
functions that you will parallelize:
norm_sequential()
, norm_blocked_reduction()
norm_blocked_critical()
, norm_cyclic_reduction(
), and norm_cyclic_critical()
. The norm_scale.exe executable will take two arguments: the size of the vector to be normed, and the number of trips to be taken when running. The reported time will be for the entire number of norm computations for each algorithm, measured in milliseconds. The measured time is not divided by the number of trips.
The program will generate a scaling plot norm_scale.png
.
Using OpenMP,
parallelize the norm()
functions in the file norm_scale.cpp, as indicated below:
norm_block_reduction()
– the function should be parallelized in block fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads()
function). The update to the shared variable sum should be protected using OpenMP reduction directive.
norm_block_critical()
– the function should be parallelized in block fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads()
function). The update to the shared variable sum should be protected using OpenMP critical directive.
norm_cyclic_reduction()
– the function should be parallelized in cyclic fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads()
function). The update to the shared variable sum should be protected using OpenMP reduction directive.
norm_cyclic_critical()
– the function should be parallelized in cyclic fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads()
function). The update to the shared variable sum should be protected using OpenMP critical directive.
The bodies of these functions currently contain a for loop that computes the sequential two norm. You may have to do more to the function than simply inserting an OMP directive.
Make sure that your norm()
functions compute the correct answer.
Note that there are also functions norm_seq()
and norm_parfor()
that are a sequential and an OMP parallel for
version of two_norm.
When your program is compiled and running, the verify.bash
script will invoke it in several ways
norm_scale_small.png
norm_scale_large.png
.
The plots that were generated each show the program’s scalability with respect to number of cores for small, medium, and large problem sizes. Which combination of problem size, blocking, and synchronization shows the best performance? Which shows the worst? Explain, using what you know about threads, locking, OpenMP, computer architecture, etc.
For this assignment, we are going to investigate OpenMP parallelization of the sparse matrix-vector products we have previously developed. Each of the sparse matrix formats (CSRMatrix, CSCMatrix, COOMatrix, AOSMatrix) in your repo have an added function: omp_matvec()
. Each of those is currently written simply in its sequential form.
The driver program for this problem is matvec_scale.cpp
which generates a sparse matrix corresponding to a discretized 2D laplacian (for various problem sizes).
The verify.bash script will invoke it for small, medium, and large sizes. As you complete parallelization of the different matrix types, you need to uncomment the appropriate pre processor directive in your Makefile. The Makefile comes with TIME_CSR
uncommented.
Parallelize the omp_matvec()
function in CSRMatrix.hpp using omp parallel for
.
Parallelize the omp_matvec()
function in COOMatrix.hpp using omp parallel for
.
(AMATH 583 Only)
Parallelize the omp_matvec()
function in CSCMatrix.hpp using omp parallel for
.
(AMATH 583 Only)
Parallelize the omp_matvec()
function in AOSMatrix.hpp using omp parallel for
.
For each of these implementations, make sure that your implementation is correct and that it is safe (no race conditions).
Think carefully about what is happening – remember that Open MP is still threads.
Which matrix formats (if any) need to be protected against race conditions?
What approach did you use to ensure safety? Did that approach affect performance?
</q>
<p class="t">
(Extra Credit)
If there is a potential for a race condition in any of the sparse matrix formats, can you create some combination of number of threads, problem size, matrix data, etc to expose it?
The plots that were generated each show the scalability with respect to number of cores for small, medium, and large problem sizes. Which combination of matrix format, problem size, blocking, and synchronization shows the best performance? Which shows the worst? Explain, using what you know about threads, locking, OpenMP, computer architecture, etc.
In the last assignment we started looking at an implementation of the PageRank algorithm, the primary computation of which is a sparse matrix-vector product.
In class and in our assignments up to now, we have taken the immediately obvious approach to partitioning work (or data) for parallelization: we divided the loop indices (or the range of Vector indices) into equal sizes and gave each piece of work to a thread.
There is an underlying assumption here, however. To get maximum parallel speedup, each thread needs to start and finish at the same time – meaning, each thread needs to have the same amount of work to do. In partitioning our work by loop index or by Vector index, we are assuming that these partitionings will result in equal amounts of work for each thread.
This assumption will hold, for example, if we are adding two vectors together (or performing numerical quadrature over a fixed interval). But, suppose we are doing a matrix vector product and partition the problem such that each thread gets an equal number of rows to process in the matvec. The amount of work per thread depends now on the number of nonzeros in the portion of the matrix it is processing. If the matrix has (approximately) an equivalent number of nonzeros per row, each thread will get (approximately) the same amount of work. In scientific computing where the matrix is derived from a discretized PDE, each row will have the same number of elements since the discretized differential operator (which gives rise to the nonzero pattern in the sparse matrix) is uniformly applied everywhere.
However, discretized PDEs aren’t the only source of sparse matrices. Consider the PageRank algorithm. Each incoming link to a page creates a non-zero in the corresponding row of the sparse matrix used to represent the page-link graph. The number of links per page in the web (or a social network, say) is highly non-uniform. Some pages have many links to them, others very few. A well-known celebrity or politician may have millions or tens of millions of followers. On the other hand, almost all of those millions of followers will have many many fewer followers of their own.
If we partition a sparse matrix derived from a social network based simply on assigning an equal number of rows from the sparse matrix, we can easily end up in the situation where one thread is processing orders of magnitude more work than another, with the resultant loss in parallel speedup.
When we are partitioning a loop or a data structure ourselves, we can also use what we know about the problem and develop an appropriate partitioning strategy. For example, we can partition a sparse matrix so that there are approximately the same number of nonzeros per partition.
The CSRMatrix class has a parallel matrix-vector product function par_matvec()
that
launches tasks to work on parts of the problem (rows of the matrix) as given in the
partitions_
array. The basic approach for partitioning the problem by rows is
contained in the partition_by_row()
function. This function divides the number of
rows in the problem by the number of partitions and assigns that size of problem to
each task. There is a bit of fix up with the remainder in case the problem size
doesn’t divide by the number of partitions evenly. The partial_sum()
call then
creates a running sum in the array, turning the sizes into the beginning row and
ending row of each partition.
For this problem, fill in the function partitition_by_nnz()
. This
should fill in the partition_
array so that there are basically the same number of
nonzero elements to be processed in each partition.
To attack this problem, think about the information stored in the arrays
you already have in the CSR matrix format. Is the information about the number of
nonzeros in each row already contained somewhere?
%
Each of your partitions should have a
begin and end such that the number of nonzeros (the number of elements) is
approximately the same for each partition.
The driver program for this problem is csr_scale.cpp. It takes as input a sparse matrix file.
You can try it with the “laplacian” files, the “graph500” files, or the “cit-Patents” file.
Doing matrix-vector product only depends on the partitioning, so all you have to do to
run the parallel matrix-vector product is to call the appropriate partition member
function. You don’t need to change par_matvec
at all.
There is a utility program in your directory that you can build and run on any of those files to display the degree distribution of the matrices they contain.
When your program is working correctly, you should see a difference in performance between the partition_by_row and partition_by_nnz partitionings.
Fortunately OpenMP also provides some mechanisms for this,
namely the schedule
option to parallel for
. We can use that option as part of a pragma:
#pragma omp parallel for schedule(static)
Alternatively, we can make a library call
omp_set_schedule(static, 0);
The driver program for this problem is csr_omp.cpp, which uses the library calls to set the type of scheduling.
Since you have already
parallelized CSRMatrix
with OpenMP, you just need to compile and run the driver.
The program takes as an option a matrix file as well as a block size to give as a hint
to omp_set_schedule()
. Experiment with different block sizes, particularly with the
cit-Patents matrix.
Which scheduling approach (and which block size) give the best load balancing performance?
Finally we come back to the PageRank algorithm. From the previous problem you have two functions to partition the CSRMatrix and perform a task-based parallel matrix-vector product. You also have an Open MP parallelized version of the matrix vector product.
Parallelize pagerank to the greatest extent that you can (to get the best scalability).
The key set of operations is in the statement
Vector y = alpha * (P * x) + (1.0 - alpha) / x.num_rows();
What do you need to change to effect parallelization of the matrix-vector product portion of the pagerank computation? What parallelization approach did you choose?
(Extra Credit) There are other operations of the expression above that can be parallelized. Use a parallelization approach of your choosing and report on the benefits obtained.
(Extra Credit) Are there any operations in the program where Amdahl’s Law would limit scalability?
Do a make clean in your ps6 directory and then create a tar ball ps6.tar.gz containing all of the files in that directory. From within the ps6 directory:
$ cd ps6
$ make clean
$ cd ..
$ tar zcvf ps6-submit.tar.gz --exclude "./ps6/.vscode" ./ps6
If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps6 subdirectory).