University of Washington  
Department of Applied Mathematics  
AMATH 483/583   Assigned: May 16, 2019
Problem Set #5   Due: May 23, 2019

Contents


Introduction

This assignment is extracted as with previous assignments into a directory ps5. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.

The tar ball for this assignment can be downloaded with this link.

Preliminaries

Bandwidth and Hierarchical Memory Parameters

One important basic characterization that you can make of your machine is to measure the bandwidth for various sizes of data and with various types of data movement instructions. Since many computations are bandwidth bound, this can give you information with which you can estimate performance bounds.

Your src directory for this problem set includes a subdirectory bandwidth, which includes open source code from https://zsmith.co/bandwidth.php, licensed under GPL and provided pro bono.

If you cd into that subdirectory, type

$ make

and you will be presented with a set of build targets. Choose one based on your environment, e.g., for Mac OS

$ make bandwidth-mac64

If you haven’t previously installed the nasm assembler, you will need to install it using brew or apt depending on your environment. Note that WSL should be considered x86_64 rather than Win32.

Once you have compiled the program, execute it. I recommend using one of the options --fast, or --faster. Even with these, the program will run for some minutes. During that time your computer should have as little load on it as possible (other than the bandwidth program itself).

When the program completes, there will be two files in the BW directory – bandwidth.png and bandwidth.csv. The file bandwidth.png is an image of a graph showing data transfter rates for various types of operations and various types of operations. The file bandwidth.csv is the raw data that is plotted in bandwidth.png, as comma separated values. You can use that data to do your own investigations about your computer.

In looking at the bandwidth graph you will see one or more places where the transfer rate drops suddenly. For small amounts of data, the data can fit in cache closer to the CPU – and hence data transfer rates will be higher. The data transfer rates will fall off fairly quickly once that data no longer completely fits into cache. Based on the bandwidth plot obtained for your computer, how many levels of cache does it have? How large are each of the caches (at what sizes does the bandwidth fall)? If possible, check your estimates against published information for the microprocessor that your system is using or with programs on your computer such as lscpu.

You may also notice that write operations tend to be slower than the corresponding read operations.

Warm Up

Futures

C++ provides the mechanism to enable execution of a function call as an asynchronous task. Its semantics are important. Consider the example we have been using in lecture.

typedef size_t size_t;

size_t bank_balance = 300;

static std::mutex atm_mutex;
static std::mutex msg_mutex;

size_t withdraw(const std::string& msg, size_t amount) {
  std::lock(atm_mutex, msg_mutex);
  std::lock_guard<std::mutex> message_lock(msg_mutex, std::adopt_lock);

  std::cout << msg << " withdraws " << std::to_string(amount) << std::endl;

  std::lock_guard<std::mutex> account_lock(atm_mutex, std::adopt_lock);

  bank_balance -= amount;
  return bank_balance;
}

int main() {
  std::cout << "Starting balance is " << bank_balance << std::endl;

  std::future<size_t> bonnie = std::async(withdraw, "Bonnie", 100);
  std::future<size_t> clyde  = std::async(deposit, "Clyde", 100);

  std::cout << "Balance after Clyde's deposit is " << clyde.get() << std::endl;
  std::cout << "Balance after Bonnie's withdrawal is " << bonnie.get() << std::endl;

  std::cout << "Final bank balance is " << bank_balance << std::endl;

  return 0;
}

Here we just show the withdraw function. Refer to the full source code for complete implementation bonnie_and_clyde.cpp .

Note that we have modified the code from lecture so that the deposit and withdrawal functions return a value, rather than just modifying the global state.

Compile and run the bonnie_and_clyde program several times. Does it always return the same answer?

On my laptop I ran the example and obtained the following results (which were different every time):

[0]$ ./bonnie_and_clyde
Starting balance is 300
Bonnie withdraws 100
Clyde deposits 100
Balance after Clyde's deposit is 300
Balance after Bonnie's withdrawal is 200
Final bank balance is 300

[1]$ ./bonnie_and_clyde
Starting balance is 300
Balance after Clyde's deposit is Bonnie withdraws 100
Clyde deposits 100
300
Balance after Bonnie's withdrawal is 200
Final bank balance is 300

[2]$ ./bonnie_and_clyde
Starting balance is 300
Clyde deposits Balance after Clyde's deposit is 100
Bonnie withdraws 100
400
Balance after Bonnie's withdrawal is 300
Final bank balance is 300

[3]$ ./bonnie_and_clyde
Starting balance is 300
Balance after Clyde's deposit is Clyde deposits 100
400
Balance after Bonnie's withdrawal is Bonnie withdraws 100
300
Final bank balance is 300

See if you can trace out how the different tasks are running and being interleaved. Notice that everything is protected from races – we get the correct answer at the end every time.

Note the ordering of calls in the program:

  std::future<size_t> bonnie = std::async(withdraw, "Bonnie", 100);
  std::future<size_t> clyde  = std::async(deposit, "Clyde", 100);

  std::cout << "Balance after Clyde's deposit is " << clyde.get() << std::endl;
  std::cout << "Balance after Bonnie's withdrawal is " << bonnie.get() << std::endl;

That is, the ordering of the calls is Bonnie makes a withdrawal, Clyde makes a deposit, we get the value back from Clyde’s deposit and then get the value back from Bonnie’s withdrawal. In run 0 above, that is how things are printed out: Bonnie makes a withdrawal, Clyde makes a deposit, we get the value after Clyde’s deposit, which is 300, and we get the value after Bonnie’s withdrawal, which is 200. Even though we are querying the future values with Clyde first and Bonnie second, the values that are returned depend on the order of the original task execution, not the order of the queries.

But let’s look at run 2. In this program, there are three threads: the main thread (which launches the tasks), the thread for the bonnie task and the thread for the clyde task. The first thing that is printed out after the tasks are launched is from Clyde’s deposit. But, before that statement can finish printing in its entirety, the main thread interrupts and starts printing the balance after Clyde’s deposit. The depositing function isn’t finished yet, but the get() has already been called on its future. That thread cannot continue until the clyde task completes and returns its value. The clyde task finishes its cout statement with “100”. But then the bonnie task starts running - and it runs to completion, printing “Bonnie withdraws 100” and it is done. But, the main thread was still waiting for its value to come back from the clyde task – and it does – so the main thread completes the line “Balance after Clyde’s deposit is” and prints “400”. Then the next line prints (“Balance after Bonnie’s withdrawal”) and the program completes.

Try to work out the sequence of interleavings for run 1 and run 3. There is no race condition here – there isn’t ever the case that the two tasks are running in their critical sections at the same time. However, they can be interrupted and another thread can run its own code – it is only the critical section code itself that can only have one thread executing it at a time. Also note that the values returned by bonnie and clyde can vary, depending on which order they are executed, and those values are returned by the future, regardless of which order the futures are evaluated.

std::future::get

Modify your bonnie_and_clyde program in the following way. Add a line in the function after the printout from Bonnie’s withdrawal to double check the value:

  std::cout << "Balance after Bonnie's withdrawal is " << bonnie.get() << std::endl;
  std::cout << "Double checking Clyde's deposit " << clyde.get() << std::endl;

(The second line above is the one you add.) What happens when you compile and run this example? Why do you think the designers of C++ and its standard library made futures behave this way?

Function Objects

One of the most powerful features of C++ is function overloading. As we have seen, one can essentially create new languages embedded within C++ by appropriately overloading functions and operators. Overloading is particular interesting. Although we have overloaded it for matrix and vector types to be used as indexing, can be defined to take any kind of arguments. In essence, overloaded allows you to pass arbitrary parameters to a function that you define – in essence letting you build your own functions.

Now, of course, when you write a function down when you are programming, you are building functions. But when you build an object that has an overloaded , your program can generate new functions. This is similar to the concept of lambda described below (and in fact was leveraged to develop the original Boost.Lambda library).

Objects that have an operator() defined are called function objects. What makes function objects so powerful is that they can be used just like a function in your programs. The syntax in C++ for a function call is a name followed by a parameter list enclosed in parentheses. This same syntax can mean a plain function call – or a call to an object’s operator() member function.

We have seen two cases already where we pass functions as parameters: std::thread and std::async. But what is important in those cases is not that an actual function is being passed to std::thread and std::async, but rather that the thing being passed is callable (i.e., that it can be passed parameters and a function run on those parameters).

Let’s look at an example. In the running Bonnie and Clyde bank deposit and withdrawal program, we have had two functions withdraw and deposit that are invoked asynchronously by the main program. Instead of defining withdraw and depositas functions, let’s instead define them as function objects.

Let’s start with a transaction class.

int bank_balance = 300;

class transaction {
public:
  transaction(int _sign, int& _balance) :
    sign(_sign), balance(_balance) {}

  void operator()(int amount) {
    std::lock_guard<std::mutex> tr_lock(tr_mutex);
    balance += sign*amount;
  }

  // Overload operator() to be "noisy"
  void operator()(const std::string& msg, int amount) {
    std::lock_guard<std::mutex> tr_lock(tr_mutex);
    std::cout << msg << ": " << sign*amount;
    std::cout << " -- Balance: " << balance << std::endl;
    balance += sign*amount;
  }

private:
  int sign;  
  int &balance;
  static std::mutex tr_mutex;     // Note static!
};
std::mutex transaction::tr_mutex; // construct static member

The constructor for this class takes a sign (which should be plus or minus one and keeps a reference to a balance value. It also has its own mutex for safely modifying the balance.

Previously, the function was defined and invoked like this

void withdraw(int amount) {
  bank_balance -= amount;
}

int main() {

  withdraw(100);

  return 0;
}

With the transaction class we can create a function object that can be used exactly the same way:

int main() {
  transaction withdraw(-1, balance);

  withdraw(100);

  return 0;
}

Again, note that we invoke withdraw the same way in both cases – yet in the first case withdraw is the function withdraw and in the second case it is an object of class transaction.

The full Bonnie and Clyde example (with threads, say) would then be:

int bank_balance = 300;

int main() {
  transaction deposit ( 1, balance);  // define an object named deposit
  transaction withdraw(-1, balance);  // define an object named withdraw

  cout << "Starting balance is " << bank_balance << endl;

  thread bonnie(withdraw, "Bonnie", 100);
  thread clyde(deposit, "Clyde", 100);

  bonnie.join();
  clyde.join();

  cout << "Final bank balance is " << bank_balance << endl;

  return 0;
}

(For brevity we omit the messages that were printed by the original functions). This looks very much like the original program where and were functions. But remember, in this case, they are objects. And they are created at run time when we construct them. In summary, function objects let you dynamically create and use stateful objects, just as if they were statically defined functions.

Read through the code for this example and run it (and/or modify it).

PageRank

In the next problem sets we will be parallelizing and evaluating PageRank (as well as other application programs). And, as we progress with the program we will also be introducing some of the more advanced C++ techniques that we have discussed in lecture, such as standard library algorithms and templates.

The program itself can take several options. It must be passed either a filename of graph (or sparse matrix) data upon which to perform PageRank, or it must be given the dimensions for a matrix generator (-p specify the size of the domain for a discretized Laplacian operator). The program also accepts options for number of lines of ranked data to print out (default is none), an output file to store its results (default is cout), the name of a file containing vertex labels, and number of threads to use once we parallelize.

We are processing command line options in a more sophisticated way than before. Rather than requiring that particular options be in a particular order on the command line, we are recognizing flags (denoted with a leading -) and using those to specify options.

To see an example of the operation of the pagerank program, build it and execute it

$ make pagerank.exe
$ ./pagerank.exe data/Erdos02.mtx -l data/Erdos02_nodename.txt -v 10

This will run the program on the “Erdos” data set and print the top ten results. (Just as Kevin Bacon is the center of the IMDB universe, Paul Erdos is the center of the mathematics universe. The Erdos02.mtx dataset contains a graph of authors who have written papers with each other and who are a distance of 2 in the graph from Erdos himself.)

You should read through the file pagerank.cpp, looking in particular at the function pagerank and, of course, main.

Problems

Norm!

In lecture we introduced the formal definition of vector spaces as a motivation for the interface and functionality of a Vector class. But besides the interface syntax, our Vector class needs to also support the semantics. One of the properties of a vector space is associativity of the addition operation, which we are going to verify experimentally in this assignment.

We have previously defined and used the function

double two_norm(const Vector&  x);

prototyped in amath583.hpp and defined in amath583.hpp.

In your source directory is a file norm.cpp that has skeleton code for this problem. It currently creates a Vector v and then invokes two_norm on it four times, checking the results against that obtained by the first invocation with assert(). Make sure not to have NDEBUG defined so that the assertions get invoked.

Modify this program by adding code (it should only take a call to a standard library function in each case) that sorts v in ascending order before calling two_norm for norm2 and then sorts v in descending order before calling two_norm for norm3.

(There may have been an instance of such a call in the STL slides in Lecture 14).

What happens when you run this program? Do the assertions for norm2 and norm3 pass?
If they don’t pass (hint, they generally won’t), comment them out and instead print out the absolute and relative differences instead of the making assertions

Note that the program takes a command line argument (more about this anon) to set the size of the vector. The default is 1M elements.

if (norm2 != norm0) {
  std::cout << "Absolute difference: " << std::abs(norm2-norm0) << std::endl;
  std::cout << "Relative difference: " << std::abs(norm2-norm0)/norm0 << std::endl;
}

(This is for norm2 - the code for norm3 should be obvious.)

Answer these questions in Questions.md. - At what problem size do the answers between the computed norms start to differ? - How do the absolute and relative errors change as a function of problem size? - Does the Vector class behave strictly like a member of an abstract vector class? - Do you have any concerns about this kind of behavior?

Norm! (Partition Edition)

Now we are going to add concurrency and (hopefully) parallelism to speed up computation of the norm. In class we created a PartitionedVector class that carried the partitioning information with it and used that to create parallel tasks. For this problem we are instead simply going to pass in the desired number of partitions and create the appropriate number of tasks based on that.

Write a function with the following prototype:

double partitioned_two_norm(const Vector& x, size_t partitions);

The function takes the number of partitions as an argument, computes its two norm (Euclidean norm) and returns that value. For your implementation, you are to use std::thread to execute worker tasks. The number of worker tasks (the number of partitions to make) is given by the partitions argument.

You may use lambda, helper functionw, or function objects to create the tasks. Whichever approach you use, each task must safely (i.e., no race conditions) accumulate the sum of squares of the elements in its partition.

You may use synchronization to safely do the accumulation into a single variable or you may accumulate into separate variables that are later summed together. It is important that the workers just compute and accumulate sums of squares and then the sum of all of those sums is computed – and then the square root taken on that final value. The pi example we worked through in lecture should be helpful guidance for this exercise. This function should go into amath583.cpp (and prototype into amath583.hpp).

Note there is no skeleton code provided for this function.

Add a variable norm4 to your program in norm.cpp that is computed by calling your partitioned_two_norm function. Your program should print out the absolute and relative errors for norm4 as you did with norm2 and norm3.

Modify the main program in norm.cpp to take a second command line argument that specifies the number of threads to launch. If the argument is not specified, your program should use a default value of 1. This value for number of threads should be passed to your invocation of partitioned_two_norm.

In the main function as delivered, we have this:

 size_t N = 1024 * 1024;

  if (argc == 2) {
    N = std::stol(argv[1]);
  }

This creates a variable N with initial value 1M. The program then checks to see if there were two arguments passed (recall that the program name itself is always the first argument). If there were two, it converts the second one into a long integer with the call to std::stol and assigns that to N.

To allow for an optional third argument, we need to do two things. One is that we need to check if there are three arguments – the program, the size, and the number of threads. So we can use a test like:

  if (argc == 3) {
    numthreads = std::stol(argv[2]);
  }

But this won’t do the right thing for N. The variable N will only be set in the case of two arguments, whereas now we want it to be set in the case of two or three (in general, two or more). Thus we want to do something like this:

  if (argc >= 2) {
    N = std::stol(argv[1]);
  }
  if (argc >= 3) {
    numthreads = std::stol(argv[2]);
  }

These tests should come after the variables are defined and initialized.

There are now three different values against which you can compare norm4. You should compare the value of norm4 to value computed with the same order you use for norm4. I suggest putting the computation of norm4 after the computation of norm3 and then compare norm4 to norm3. The reason for this is that we want to look at the effects of parallelization independently of the ordering.

Answer these questions in Questions.md. - At what number of threads do you start to see a difference between norm4 and norm3? Note that you can run as many threads as you like (recall the difference between concurrency and parallelism). - Is this difference always the same? - Do you have any concerns about this kind of behavior?

Norm! (Recursion Version)

Another approach (common in the functional programming community – where lambda comes from) for decomposing a problem into smaller parts is recursion. Recursion is an interesting issue in functional programs because the programs are what the name implies: mathematical functions.

Whereas it isn’t too difficult in the imperative world to think of a function calling itself, in the functional world, recursion means that a function is defined in terms of itself. For example, we could define a function of $f(x)$ in the following way

But then what is $f$? It’s defined in terms of itself. And, what do we get when we plug in a value for $x$, say, 3?

To make a self-referential definition well defined we have to define a specific value for the function with some specific argument. This is usually called the “base case”. In this example, we could define $f(0) = 1$. Then the function is defined as:

One interesting way to think about recursion is to think of it in terms of abstraction. In the above example, we don’t define what $f(x)$ is. Rather we define it as $x$ times $f(x-1)$ – that is, we are abstracting away the computation – we are pretending that we know the value of $f(x-1)$ when we compute the value of $f(x)$.

This interpretation is also useful for decomposing a problem into smaller pieces that we assume we can compute. For instance, suppose we want to take the sum of squares of the elements of a vector (as a precursor for computing the Euclidean norm, say). Well, we can abstract that away by pretending we can compute the sum of squares of the first half of the vector and of the second half of the vector. The sum of squares of the entire vector is the sum of the values from the two halves. For example:

double sum_of_squares(const Vector& x, size_t begin, size_t end) {
  return sum_of_squares(x, begin,               begin+(end-begin)/2)
       + sum_of_squares(x, begin+(end-begin)/2, end) ;
}

To use an over-used phrase – see what we did there? We’ve defined the sum_of_squares function as the result of that same function on two smaller pieces (which we assume we can compute). This general approach, particularly when dividing a problem in half to solve it, is also known as “divide and conquer.”

Now, in this code example, as with the mathematical example, we need some non-self-referential value. In this case there are several options. We could stop when the difference between begin and end is sufficiently small. Or, we can stop after we have descended a specified number of levels. In the latter case:

double sum_of_squares(const Vector& x, size_t begin, size_t end, size_t level) {
  if (level == 0) {
    return // code that computes the sum of squares of x from begin to end
  } else
    return sum_of_squares(x, begin,               begin+(end-begin)/2, level-1)
         + sum_of_squares(x, begin+(end-begin)/2, end                , level-1) ;
  }
}

Now, we can parallelize this divide and conquer approach by launching asynchronous tasks for the recursive calls and getting the values from the futures.

Consider for a second what will happen though particularly if we use std:::launch::deferred. We won’t actually do any computation until the entire tree of tasks is built. The first call will create two futures but do nothing until get is called. When get is called on them, each of them will create two more futures that don’t do anything untilt their get is called, and so on until the base case is reached – at which point the tree collapses back upwards.

Write a function with the following prototype:

double recursive_two_norm(const Vector& x, size_t levels);

The function takes x as an argument, computes its two norm (Euclidean norm) and returns that value. For your implementation, you are to use to std::async to execute worker tasks. The number of levels to recurse is given by the argument levels.

As with the partitioned version of two_norm above, you may use a helper function, a lambda, or a function object for task creation.

And as before, each task should safely accumulate the sum of squares of the elements in between some defined range (the half-open interval defined by begin and end above) of the Vector v by asynchronously invoking itself with intervals of half the size (as shown above). Your copde should go into your amath583.hpp and amath583.cpp files.

The file pnorm.cpp is a driver that invokes partitioned_two_norm and recursive_two_norm for a sequence of numbers of threads. It will create a plot showing the execution time for these two functions as a function of number of threads. Run the program with a problem size sufficient to give an execution time of 1000-2000 ms for the first trial (one thread).

Answer these questions in Questions.md. - How much difference did you see between the partitioned and the recursive versions? - Under what conditions (problem size, number of threads, etc.) were the differences most pronounced?

Matrix Vector Product (AMATH 583 only)

(This problem can be done as extra credit for 483 students.)

Write a function for parallel matrix-vector product with the following prototype

void par_matvec(const Matrix& A, const Vector& x, Vector& y, size_t partitions);

As with previous matrix-vector product implementations, the vector y should contain the product of A and x when the function returns. The function should divide the overall problem into the given number of partitions.

You may use threads or tasks, lambda, helper functions, or function objects in your parallelization approach.

A driver program for this problem has been provided in matvec.cpp, which takes problem size, maximum number of threads, and number of trips to time on the command line. When run, the program creates a scaling plot of speedup vs number of threads.

Extra Credit.

Implement a second parallel matrix-vector product using a technique for tasking other than the one above. Add that function into matvec.cpp and plot its scaling results.

Submitting Your Work

Do a make clean in your ps5 directory and then create a tar ball ps5.tar.gz containing all of the files in that directory. From within the ps5 directory:

$ cd ps5
$ make clean
$ cd ..
$ tar zcvf ps5-submit.tar.gz --exclude "./ps5/.vscode" ./ps5

If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps5 subdirectory).

Appendix

More on Lambda

Most of us are accustomed to programming in an imperative fashion. Programs are an ordered set of commands that govern what the CPU should process and in what order. Even in a parallel program written in an imperative fashion, each of the concurrent executions are an ordered set of instructions. In contrast, functional programming consists of expressions that are evaluated to produce a final value. In pure functional languages, there is no state per se, there is no notion of assignment. In functional programming, the treatment of functions and data are unified. Expressions can evaluate to be functions, for example. That is, the value of an expression can be function, just as the value of an expression could also be a number or a string or a list. Using Scheme as an example, the following expression evaluates to be a function.

(lambda (x) (* x x))

The value of this expression is a function – a lambda expression. A function can be applied – in this case the function takes one parameter and evaluates the product of that parameter with itself (i.e., it squares it). Note that this function has no name – it just is a function. The function itself is distinct from the body of the function. The expression is the value obtained by applying the multiplication operator to x and x.

An unnamed function in Scheme can be used just as a named function:

=> ((lambda (x) (* x x)) 7)
49

The expression above applies the anonymous function to the value 7 and returns 49.

C++11 (and the Boost.Lambda library before it) provides support for lambdas – for unnamed functions – in C++. Besides the expressiveness in being able generate new functions from other functions, lambdas are convenient when one wants to pass a function to another function, such as when we use std::async.

A sequential version of the matrix-vector multiply function looks like this:

void matvec(const Matrix& A, const Vector& x, Vector& y) {
  double sum = 0.0;
  for (int i = 0; i < A.numRows(); ++i) {
    sum = 0.0;
    for (size_t j = 0; j < A.numCols(); ++j) {
      sum += A(i, j) * x(j);
    }
    y(i) = sum;
  }
}

We can apply a simple parallelization to this nested loop by performing the inner loop as parallel tasks. However, to realize that without lambda, we must create a helper function and put the inner loop computation there.

double matvec_helper(const Matrix& A, const Vector& x, int i, double init) {
  for (size_t j = 0; j < A.numCols(); ++j) {
    init += A(i, j) * x(j);
  }
  return init;
}

void matvec(const Matrix& A, const Vector& x, Vector& y) {
  std::vector<std::future <double> > futs(A.numRows());
  for (int i = 0; i < A.numRows(); ++i) {
    futs[i] = std::async(matvec_helper, A, x, i, 0.0);
  }
  for (int i = 0; i < A.numRows(); ++i) {
    y(i) = futs[i].get();
  }
}

Unfortunately, now the logic of the original single algorithm is spread across two functions.

With a lambda on the other hand, we don’t need to create a separate, named, helper function. We just need to create an anonymous function right at the call site to std::async.

void matvec(const Matrix& A, const Vector& x, Vector& y) {
  std::vector<std::future <double> > futs(A.numRows());
  for (int i = 0; i < A.numRows(); ++i) {
    futs[i] = std::async(   [&A, &x](int row) -> double {
    double sum = 0.0;
    for (size_t j = 0; j < A.numCols(); ++j) {
      sum += A(row, j) * x(j);
    }
    return sum;
      }, i);
  }
  for (int i = 0; i < A.numRows(); ++i) {
    y(i) += futs[i].get();
  }
}

In C++, lambdas begin with [], which can optionally have “captured” variables passed (that is, variables outside of the scope of the lambda that will be accessed – in this case A and x). The value that the function is really parameterized on is the row we are performing the inner product with, so we make row a parameter to the lambda. The function returns a double, which we indicate with ->. Finally we have the body of the function, which is just like the helper function body – and just like the inner loop we had before. Finally, we indicate that the argument to be passed to the function when it is invoked is i.

Since is in the outer scope of the lambda, we can also capture it:

  void matvec(const Matrix& A, const Vector& x, Vector& y) {
    std::vector<std::future <double> > futs(A.numRows());
    for (int i = 0; i < A.numRows(); ++i) {
      futs[i] = std::async(   [&A, &x, i]() -> double {
    double sum = 0.0;
    for (size_t j = 0; j < A.numCols(); ++j) {
      sum += A(i, j) * x(j);
    }
    return sum;
        });
    }
    for (int i = 0; i < A.numRows(); ++i) {
      y(i) += futs[i].get();
    }
  }
  

There are a variety of online references to learn more about C++ lambdas and I encourage you to learn more about – and to use – this powerful programming feature.


Previous section:
Next section: