AMATH 483/583

University of Washington
Department of Applied Mathematics
AMATH 483/583		Assigned: May 23, 2019
Problem Set #6		Due: May 30, 2019

Introduction
Preliminaries
Warm Up
- ompi_info
- OpenMP is Still Threads
Exercises
Submitting Your Work

Introduction

This assignment is extracted as with previous assignments into a directory ps6. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.

The tar ball for this assignment can be downloaded with this link.

Preliminaries

Accessing AWS

Scaling tests for the last three assignments in the course ( PS6 , PS7 , and the Final ) will be done using a server instance on AWS. Instructions for accessing AWS can be found here.

These machines will have substantially more cores (16 to 32) than your laptop and can therefore support better scaling studies. Note however that these machines are shared resources. For PS6 and PS7 you should be able to complete the assignments on a shared instance. But, be considerate of your fellow students and try to limit the size and duration of your runs while you are testing and debugging.

NB: If you have access to another machine with 16 or so cores, you may use that instead of AWS.

Shared / Public Directory

On the AWS instance there is a public directory “/nfs/home/public” where I will put the assignments as well as auxiliary data for assignments. Some of the data files we will be using are fairly large. I recommend not copying them to your home directory, but rather reading them directly from the public directory.

For example, if you were running a the program scale.exe and wanted to use the file cit-Patents.mmio from the public directory, you could run

$ ./scale.exe -i /nfs/home/public/ps6/data/cit-Patents.mmio

Assuming the above path is where the file is and scale.exe used the -i option. The point is to access large files directly from the public repo rather than copying them.

If you do want to just use a local name for the file rather than the full path, you can create a symbolic link:

$ ln -s /nfs/home/public/ps6/data/cit-Patents.mmio .

A symbolic link is a reference to the actual file rather than a copy, and can be accessed and treated as if it were the actual file.

$ ./scale.exe -i cit-Patents.mmio

Sparse Matrix Files

Besides some specific files in the public data directory, there are two sets of files, each set of which has the same structure, but different scales. The “Laplace” files are sparse matrices from 2D discretized Laplace’s equation and the “graph500” files are synthetically generated sparse matrices that supposedly mimic social network behavior.

The files are in binary format so that they can be loaded much more quickly than text-based files into your programs. If you want to see a little bit of information about each file (rows, columns, number of nonzeros), the first two lines are in text:

$ head -2 cit-Patents.mmio

The numbers in the second line are, respectively, the number of rows, the number of columns, and the number of nonzeros.

Setting up OpenMP

Mac OS

The default compiler in Mac OS will support OpenMP but it has to be invoked in a special way. The typical way (with g++ and with clang on Linux) to invoke the compiler with OpenMP is to add the flag -fopenmp. If you just use the flag that way with Mac OS clang it will complain with an error message

clang: error: unsupported option '-fopenmp'

To get clang to accept the option, you need to add the -Xpreprocessor flag, e.g.,

c++ -Xpreprocessor -fopenmp -c hello_omp.cpp -o hello_omp.o

That will be enough to get your program to compile into a .o file, but you also need to specify the OpenMP libraries.

c++ -Xpreprocessor -fopenmp hello_omp.cpp -o hello_omp.exe -lomp

Since Mac OS does not readily support OpenMP compilation, it also doesn’t include the necessary libraries for OpenMP. To install those, you need to invoke brew (or your favorite package manager) to install them.

brew install libomp

Once you have done that, you should be able to compile your OpenMP programs. I recommend adding (or making sure they are already added) the compilation and linking flags to your Makefile.

Another alternative for Mac OS X is to install g++ and use that. However, I don’t necessarily recommend having two installed compilers in your system.

Linux and WSL

I have tested out compiling for OpenMP with Ubuntu 18.04 – native and under WSL. Both clang and g++ support the -fopenmp flag. If you get an error message at link time complaining about undefined symbols (which symbols look OpenMP related), then you will also need to install the OpenMP libraries. Under Ubuntu you can do this with:

sudo apt-get -y intall libomp5

You may also want to check with apt search to see what version of OpenMP libraries are available to install. But the version 5 worked for me (at least for a few tests).

AWS

OpenMP should already by installed on the AWS instance.

Environment Variables

For command-line programs (like the ones we have been writing in this course), the user of a program can pass parameters to it on the command line and they are “automagically” put into the argv array. This is obviously a very useful way for a user of a program to state their intentions to the program. Often however, the program itself needs to find out information about the environment in which it is executing – and this may be information that the user might not know or care about or that would be inconvenient to pass on the command line every time a program is run.

To enable a program to obtain information about its environment, most operating systems have a simple look up mechanism that associates variable names with values (though these values are also just strings). One environment variable that we have discussed previously in the class was the status variable (i.e., the return value from a main() function).

From the bash command line, if we want to retrieve the value associated with an environment variable (if we want to evaluate it) we prepend a $ sign to it. In bash the status variable was retrieved with $?, though in the csh shell, it is retrieved with $status. As with program variables, evaluating a variable is not the same as printing it out. If you just type

$ $?

at the command line, you will get the cryptic response

0: Command not found

because the shell evaluated \$?, which evaluated to 0. This in turn was evaluate by bash just as if we had typed in

$ 0

and since 0 is (probably) not an executable command, the shell prints its error message. If you want to examine the value of an environment variable in the shell, you need to pass that value to a program that will just print its argument back out. In most Unix like environments, this command is echo:

$ echo $?
0

$ echo $SHELL
/bin/bash

Now, there are actually two overlapping namespaces in most command-line shells: the environment variables and the shell variables. These variables are accessed in the same way, with the $ prefix – but only the environment variables are available to running programs. The shell variables are used only by the shell program itself.

To see a list of all of the environment variables, issue the command:

$ env

Note that this is an executable program, not a built-in shell command, so it will retrieve and print the list of environment variables that are accessable by all programs. Try this yourself. You should see a couple dozen lines that look like this (I am just showing a few):

HOSTNAME=259e82e56952
SHELL=/bin/bash
TERM=xterm
HOME=/home/amath583
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/home/amath583
SHLVL=1

These variables in the environment are not special in any particular way. Again, the environment just maintains an association between variable names (e.g., HOME) and a value (e.g., /home/amath583). There are conventions established about certain environment variables are used by different programs. And, some programs will update the environment. For instance if you query the value of PWD:

$ cd /
$ echo $PWD
/
$ cd
$ echo $PWD
/home/amath583

so the name of the current working directory is updated whenever you change directories.

Another important environment variable is the path (PATH). This contains a colon separated list of directories. When you enter a command on the command line, the shell will search for a program that matches the name you typed in each of those directories. If it finds a match, it will execute that program (using fork() and exec() as we have discussed in lecture). If it does not find a match it will return an error message (and set the status to a non-zero value).

But how does the bash shell – which is just a program – get the value of the environment variables? The C++ standard library provides a function for looking up environment variables.

Disclaimer. As with main() when passed arguments, the function for querying the environment deals with character pointers. I will first explain the function itself and then give you a wrapper function that you can use so that you don’t have to actually deal with pointers.

The function for querying the is std::getenv(), prototyped as follows:

char* std::getenv(char *str);

The function will look for a variable whose name matches the character string passed in as str.

NB: This function comes from the standard C library. In C a character string is a null-terminated sequence of characters. Character strings in C are passed by reference in some sense by passing the memory address of the first character in the string. Note that there is no specification about the size of the string. The string is demarcated by the beginning (the pointer) and the end (a special character) and can be of arbitrary length. This is the source of innumerable security exploits.

If there is a match, between the string pointed to by str and a variable in the environment, getenv t will return a pointer to the corresponding value in the environment.

If a match is not found – and this another one of the appalling characteristics of pointers, it will indicate an error, but will indicate that error by setting the value of the pointer to a special value, namely NULL (or zero). NULL pointers are no end of trouble and have probably set the software community back a generation. And they are a problem (this a general problem with pointers – but NULL is the most common) because when a programmer has a pointer they want to de-reference it. If for some reason the pointer is NULL or some other invalid value, the program will throw an exception – a segmentation violation – and it will abort. Since pointers can be manipulated just like any other variable in C, it is easy to give one an invalid value. Moreover, as with getenv, the special (and dangerous) value of NULL is often used to indicate an error condition for function calls that return pointers. If that value is not checked, and then later used, the rare occasion that there is a failure in that call will result in a segmentation violation.

If you are curious, try this program for yourself

int main() {
  char* ptr = 0;
  return *ptr;
}

So that you can avoid dealing with pointers (and you should), I provide the following wrapper.

#include <iostream>
#include <cstdlib>
#include <string>

std::string getenv_to_string(const char *in) {
  char *gotten = std::getenv(in);
  if (NULL == gotten) {
    return std::string("");
  } else {
    return std::string(gotten);
  }
}

std::string getenv(const std::string& in) {
  return getenv_to_string(in.c_str());
}

Just pass in a C string literal (pointer to char) or a std::string and it will return a string.

Experiment with executing env and examining some of the variables individually. Create some new variables and examine their values. In a separate shell window, you might try the following (as an example of what to be careful about):

What do you get when you issue the command echo $PATH?
What happens when you issue the command export PATH=""?
What happens when you issue the command echo $PATH following the command above?
^D may come in handy after the above.

Warm Up

As we have seen in lecture, OpenMP provides a compiler directive based mechanism for parallelizing a job for shared-memory execution. One of the principles upon which OpenMP is based is that the program source code does not need to be changed – the compiler, as controlled by OpenMP directives (#pragma omp statements), manages all concurrency. At the same time, there are parameters about the running parallel program that we would like the parallel program to be aware of. For example, we might want to specify the number of threads that a program uses to execute.

Now, with the OpenMP philosophy, the program itself can be both sequential and parallel – and it should largely be “unaware” of its parallelization. (This should be true for any parallel program – concerns should be well-separated.) Rather than using command-line arguments as a way to pass information to OpenMP, OpenMP uses environment variables. Environment variables also have the advantage (compared to command-line arguments) that they can be accessed from any arbitrary part of the program – including from a library. You can find a list of environment variables that OpenMP uses on the OpenMP web site or the OpenMP quick reference (http://bit.ly/2pCSigX). The most important environment variable is OMP_NUM_THREADS, which sets the (maximum) number of threads that OpenMP will use at run-time. NB: It is important to remember that environment variables don’t actually do anything. Setting the value of OMP_NUM_THREADS to some value doesn’t automatically parallelize your program or cause OpenMP to have that many threads. Rather it is a variable whose value any program can read – it is up to the program what action to take (if any) in response to the value it reads.

NB There may be an incompatibility between clang and the cmath include file – this problem will only show up if you compile with the -Ofast compilation flag in addition to -fopenmp. (The conflict has to do with recently introduced SIMD commands in OpenMP 4.0.) If you get a page full of errors for what should be a correct program, change -Ofast to -O3.

When a program is compiled with OpenMP, it can be run just as with any other program – execute it on the shell command line.

ompi_info

When running a parallel program, there are several pieces of information we would like to have. For example, in the last assignment, we partitioned our work into pieces and let a thread or task execute each piece (in parallel). For that assignment, we specified the number of threads / tasks to use so that we could experiment with different numbers of threads. In general, however, we don’t want the user to just arbitrarily set the number of threads that a job runs with. We might not want to run with more threads than there are cores available, for example. Or, we might want to set some resource limit on how many threads are used (which might be more or less than the number of cores). That is, we want to let the runtime system have some control over how our parallel program runs (and doing this automatically is in the spirit of OpenMP).

Consider the following program that inspects the OpenMP environment:

int main() {

  std::string envName = "OMP_NUM_THREADS";
  std::cout << envName << "        = " << getenv(envName) << std::endl;

  std::cout << "hardware_concurrency() = " <<
        std::thread::hardware_concurrency() << std::endl;
  std::cout << "omp_get_max_threads()  = " <<
        omp_get_max_threads() << std::endl;
  std::cout << "omp_get_num_threads()  = " <<
        omp_get_num_threads() << std::endl;

  return 0;
}

Note that we are using the getenv() wrapper we presented earlier.

When I run this in my shell I get the following output (the numbers might vary depending on your hardware):

./a.out
OMP_NUM_THREADS        =
hardware_concurrency() = 4
omp_get_max_threads()  = 4
omp_get_num_threads()  = 1

In this case, OMP_NUM_THREADS is not set so OpenMP uses default values. However, we have to be careful how we interpret this. The first line shows that OMP_NUM_THREADS is not set. The second line is obtained with a call to the C++ standard library (outside of OpenMP) and shows that the available hardware concurrency (number of cores) is equal to four. The next two lines were obtained with calls to functions in the OpenMP library. They show that the maximum available number of OpenMP threads is equal to four. The final line, however, may seem curious – it says that the number of threads is equal to one.

If we take a step back, though, number of threads equal to one makes sense in the context where it is called. Recall from lecture that OpenMP is based on the fork-join model. It runs sequentially between OpenMP directives, the code blocks following which are run in parallel. In main() where the call to omp_get_num_threads, there is only one running thread – there are no other threads running concurrently with it. But, how many threads actually run at one time with OpenMP? And how can we tell?

Let’s add a simple function to the above:

int omp_thread_count() {
  int n = 0;
#pragma omp parallel reduction(+:n)
  n += 1;
  return n;
}

This function will create the default number of threads and each one of those will increment the variable n by one – the final value of n will be the number of threads that executed in parallel. If we add a call to omp_thread_count() to the above program we obtain as output:

./a.out
OMP_NUM_THREADS        =
hardware_concurrency() = 4
omp_get_max_threads()  = 4
omp_get_num_threads()  = 1
omp_thread_count()     = 4

And, indeed, the number of threads that were run in parallel is four.

Another approach would be to call omp_get_num_threads() from inside a parallel scope:

int omp_thread_count() {
  int n = 0;
#pragma parallel 
  {
    int tid = omp_get_thread_num();
    if (0 == tid) {
      n = omp_get_num_threads();
    }
  }
  return n;
}

OpenMP is Still Threads

One advantage as a programmer to using OpenMP is that parallelization can be done in line. When we parallelized programs explicitly using threads and tasks in C++, we had to bundle the work we wanted to be done into a separate function and launch a thread or task to execute that function. With OpenMP we can parallelize directly in line. For example, we can create a parallel “Hello World” as follows:

int main() {
  std::cout << "omp_get_max_threads()  = " <<
        omp_get_max_threads() << std::endl;
  std::cout << "omp_get_num_threads()  = " <<
        omp_get_num_threads() << std::endl;

 #pragma omp parallel
  {
   std::cout << "Hello! I am thread " << omp_get_thread_num() <<
        " of " << omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
          std::endl;
   }

  return 0;
}

There is no helper function needed.

But, OpenMP is not a panacea. Underneath the covers, OpenMP runs threads. This is the output from one run of the above program:

$ ./a.out
omp_get_max_threads()  = 4
omp_get_num_threads()  = 1
Hello! I am thread Hello! I am thread 0 of 4
Hello! I am thread 2 of 4
My C++ std::thread id is 140325565663104
My C++ std::thread id is 140325592336192
3 of 4
My C++ std::thread id is 140325561464768
Hello! I am thread 1 of 4
My C++ std::thread id is 140325569861440

The first thing to notice is that we can get the thread id for the current thread with C++ standard library mechanisms. Problematically though, note that the strings that are being printed are interleaved with each other. We have seen this before in multithreading – when we had race conditions.

Just as with the explicit threading case, OpenMP can have race conditions when shared data is being written. However, as with the mechanisms for parallelization, the OpenMP mechanisms for protecting data are also expressed as #pragma omp directives. We can fix this race by adding a #pragma omp critical. Note that this is more general than the mutex approach we had before. We are just telling OpenMP that there is a race – not how to protect it.

int main() {
  std::cout << "omp_get_max_threads()  = " << omp_get_max_threads() << std::endl;
  std::cout << "omp_get_num_threads()  = " << omp_get_num_threads() << std::endl;

 #pragma omp parallel
 #pragma omp critical
  {
   std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " <<
          omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
          std::endl;
   }

  return 0;
}

The code for this example is contained in the file hello_omp.cpp. In this example, there are three commented lines with #pragma omp directives.

  // #pragma omp parallel
  // #pragma omp parallel num_threads(2)
  // #pragma omp critical
  {
 std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " <<
        omp_get_num_threads() << std::endl;
  std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
        std::endl;
   }

Experiment with this program a little bit until you get a feel for how OpenMP is managing threads and a feeling for how the OpenMP concepts of threads etc map to what you already know about threads. You might try the following:

Uncomment the first line. Try different values of OMP_NUM_THREADS. You should see different numbers of “Hello” messages being printed. What happens when you have more threads than hardware concurrency? Are the thread IDs unique?
With the first line uncommented, see what the difference is between having the second line uncommented or not.
Comment out the first line and uncomment the second. How many threads get run? What happens when you have different values of OMP_NUM_THREADS?
Finally, OpenMP is a little sneaky. Although it appears to operate in the form of preprocessor directives – and removing those preprocessor directives from the program will disable OpenMP – the directives can still interoperate with the program. To see what I mean by that, change the line #pragma omp parallel num_threads(2) to #pragma omp parallel num_threads(user_num) where user_num is an int. I.e.,

  size_t user_num;
  if (argc >= 2) user_num = std::stol(argv[1]);
  #pragma omp parallel num_threads(user_num)
  {
    #pragma omp critical
    {
      std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
      std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
    }
  }

Does the program compile when using a program variable in the directive? (Hint: Yes.) Run your program, passing in different (integer) values on the command line. What has precedence, the variable or OMP_NUM_THREADS regarding how many threads are actually run?

Exercises

Norm

In problem set 5 we explored parallelizing some functions for computing the Euclidean norm of a vector. In this assignment we are going to look at how to use OpenMP to parallelize that computation. The code for this program is in the file norm_scale.cpp. There are five different functions that you will parallelize: norm_sequential(), norm_blocked_reduction() norm_blocked_critical(), norm_cyclic_reduction(), and norm_cyclic_critical(). The norm_scale.exe executable will take two arguments: the size of the vector to be normed, and the number of trips to be taken when running. The reported time will be for the entire number of norm computations for each algorithm, measured in milliseconds. The measured time is not divided by the number of trips.

The program will generate a scaling plot norm_scale.png.

Using OpenMP, parallelize the norm() functions in the file norm_scale.cpp, as indicated below:

norm_block_reduction() – the function should be parallelized in block fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads() function). The update to the shared variable sum should be protected using OpenMP reduction directive.
norm_block_critical() – the function should be parallelized in block fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads() function). The update to the shared variable sum should be protected using OpenMP critical directive.
norm_cyclic_reduction() – the function should be parallelized in cyclic fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads() function). The update to the shared variable sum should be protected using OpenMP reduction directive.
norm_cyclic_critical() – the function should be parallelized in cyclic fashion according to the number of available OpenMP threads (as reported by your my_get_num_threads() function). The update to the shared variable sum should be protected using OpenMP critical directive.

The bodies of these functions currently contain a for loop that computes the sequential two norm. You may have to do more to the function than simply inserting an OMP directive.

Make sure that your norm() functions compute the correct answer.

Note that there are also functions norm_seq() and norm_parfor() that are a sequential and an OMP parallel for version of two_norm.

When your program is compiled and running, the verify.bash script will invoke it in several ways

An execution where the problem size is 100x smaller than the default (but the number of trips is 100x) larger. These results will be copied to the file norm_scale_small.png
An execution where the problem size is 100x larger than the default but the number of trips is 100x smaller. These results will be copied to the file norm_scale_large.png.

The plots that were generated each show the program’s scalability with respect to number of cores for small, medium, and large problem sizes. Which combination of problem size, blocking, and synchronization shows the best performance? Which shows the worst? Explain, using what you know about threads, locking, OpenMP, computer architecture, etc.

Sparse Matrix-Vector Product

For this assignment, we are going to investigate OpenMP parallelization of the sparse matrix-vector products we have previously developed. Each of the sparse matrix formats (CSRMatrix, CSCMatrix, COOMatrix, AOSMatrix) in your repo have an added function: omp_matvec(). Each of those is currently written simply in its sequential form.

The driver program for this problem is matvec_scale.cpp which generates a sparse matrix corresponding to a discretized 2D laplacian (for various problem sizes). The verify.bash script will invoke it for small, medium, and large sizes. As you complete parallelization of the different matrix types, you need to uncomment the appropriate pre processor directive in your Makefile. The Makefile comes with TIME_CSR uncommented.

Parallelize the omp_matvec() function in CSRMatrix.hpp using omp parallel for.

Parallelize the omp_matvec() function in COOMatrix.hpp using omp parallel for.

(AMATH 583 Only) Parallelize the omp_matvec() function in CSCMatrix.hpp using omp parallel for.

(AMATH 583 Only) Parallelize the omp_matvec() function in AOSMatrix.hpp using omp parallel for.

For each of these implementations, make sure that your implementation is correct and that it is safe (no race conditions).

Think carefully about what is happening – remember that Open MP is still threads.
Which matrix formats (if any) need to be protected against race conditions? What approach did you use to ensure safety? Did that approach affect performance? </q> <p class="t"> (Extra Credit) If there is a potential for a race condition in any of the sparse matrix formats, can you create some combination of number of threads, problem size, matrix data, etc to expose it?

The plots that were generated each show the scalability with respect to number of cores for small, medium, and large problem sizes. Which combination of matrix format, problem size, blocking, and synchronization shows the best performance? Which shows the worst? Explain, using what you know about threads, locking, OpenMP, computer architecture, etc.

Partitioning and Load Balancing

In the last assignment we started looking at an implementation of the PageRank algorithm, the primary computation of which is a sparse matrix-vector product.

In class and in our assignments up to now, we have taken the immediately obvious approach to partitioning work (or data) for parallelization: we divided the loop indices (or the range of Vector indices) into equal sizes and gave each piece of work to a thread.

There is an underlying assumption here, however. To get maximum parallel speedup, each thread needs to start and finish at the same time – meaning, each thread needs to have the same amount of work to do. In partitioning our work by loop index or by Vector index, we are assuming that these partitionings will result in equal amounts of work for each thread.

This assumption will hold, for example, if we are adding two vectors together (or performing numerical quadrature over a fixed interval). But, suppose we are doing a matrix vector product and partition the problem such that each thread gets an equal number of rows to process in the matvec. The amount of work per thread depends now on the number of nonzeros in the portion of the matrix it is processing. If the matrix has (approximately) an equivalent number of nonzeros per row, each thread will get (approximately) the same amount of work. In scientific computing where the matrix is derived from a discretized PDE, each row will have the same number of elements since the discretized differential operator (which gives rise to the nonzero pattern in the sparse matrix) is uniformly applied everywhere.

However, discretized PDEs aren’t the only source of sparse matrices. Consider the PageRank algorithm. Each incoming link to a page creates a non-zero in the corresponding row of the sparse matrix used to represent the page-link graph. The number of links per page in the web (or a social network, say) is highly non-uniform. Some pages have many links to them, others very few. A well-known celebrity or politician may have millions or tens of millions of followers. On the other hand, almost all of those millions of followers will have many many fewer followers of their own.

If we partition a sparse matrix derived from a social network based simply on assigning an equal number of rows from the sparse matrix, we can easily end up in the situation where one thread is processing orders of magnitude more work than another, with the resultant loss in parallel speedup.

When we are partitioning a loop or a data structure ourselves, we can also use what we know about the problem and develop an appropriate partitioning strategy. For example, we can partition a sparse matrix so that there are approximately the same number of nonzeros per partition.

Load Balanced Partitioning with Tasks

The CSRMatrix class has a parallel matrix-vector product function par_matvec() that launches tasks to work on parts of the problem (rows of the matrix) as given in the partitions_ array. The basic approach for partitioning the problem by rows is contained in the partition_by_row() function. This function divides the number of rows in the problem by the number of partitions and assigns that size of problem to each task. There is a bit of fix up with the remainder in case the problem size doesn’t divide by the number of partitions evenly. The partial_sum() call then creates a running sum in the array, turning the sizes into the beginning row and ending row of each partition.

For this problem, fill in the function partitition_by_nnz(). This should fill in the partition_ array so that there are basically the same number of nonzero elements to be processed in each partition.

To attack this problem, think about the information stored in the arrays you already have in the CSR matrix format. Is the information about the number of nonzeros in each row already contained somewhere?
% Each of your partitions should have a begin and end such that the number of nonzeros (the number of elements) is approximately the same for each partition.

The driver program for this problem is csr_scale.cpp. It takes as input a sparse matrix file.
You can try it with the “laplacian” files, the “graph500” files, or the “cit-Patents” file.

Doing matrix-vector product only depends on the partitioning, so all you have to do to run the parallel matrix-vector product is to call the appropriate partition member function. You don’t need to change par_matvec at all.

There is a utility program in your directory that you can build and run on any of those files to display the degree distribution of the matrices they contain.

When your program is working correctly, you should see a difference in performance between the partition_by_row and partition_by_nnz partitionings.

Load Balanced Partitioning with OpenMP (AMATH583)

Fortunately OpenMP also provides some mechanisms for this, namely the schedule option to parallel for. We can use that option as part of a pragma:

#pragma omp parallel for schedule(static)

Alternatively, we can make a library call

omp_set_schedule(static, 0);

The driver program for this problem is csr_omp.cpp, which uses the library calls to set the type of scheduling. Since you have already parallelized CSRMatrix with OpenMP, you just need to compile and run the driver. The program takes as an option a matrix file as well as a block size to give as a hint to omp_set_schedule(). Experiment with different block sizes, particularly with the cit-Patents matrix.

Which scheduling approach (and which block size) give the best load balancing performance?

PageRank

Finally we come back to the PageRank algorithm. From the previous problem you have two functions to partition the CSRMatrix and perform a task-based parallel matrix-vector product. You also have an Open MP parallelized version of the matrix vector product.

Parallelize pagerank to the greatest extent that you can (to get the best scalability).

The key set of operations is in the statement

    Vector y = alpha * (P * x) + (1.0 - alpha) / x.num_rows();

What do you need to change to effect parallelization of the matrix-vector product portion of the pagerank computation? What parallelization approach did you choose?

(Extra Credit) There are other operations of the expression above that can be parallelized. Use a parallelization approach of your choosing and report on the benefits obtained.

(Extra Credit) Are there any operations in the program where Amdahl’s Law would limit scalability?

Submitting Your Work

Do a make clean in your ps6 directory and then create a tar ball ps6.tar.gz containing all of the files in that directory. From within the ps6 directory:

$ cd ps6
$ make clean
$ cd ..
$ tar zcvf ps6-submit.tar.gz --exclude "./ps6/.vscode" ./ps6

If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps6 subdirectory).

Previous section:

PS5

Next section:

PS7

Contents