Paralell processing map()
In this post, we yet again build on the former post, this time to understand parallel processing a bit more. Remember the poll from the last post, where we calculated CI:s from a poll?
Where the results looked like this?
| Example poll | ||
|---|---|---|
| Party | # of votes | Share |
| C | 80 | 8.00% |
| KD | 59 | 5.90% |
| L | 37 | 3.70% |
| M | 211 | 21.10% |
| MP | 34 | 3.40% |
| Other/no reply | 14 | 1.40% |
| S | 282 | 28.20% |
| SD | 179 | 17.90% |
| V | 104 | 10.40% |
| Data from here | ||
Good. In it, we calculated the CI:s using purrr and a function we’d written that used infer to bootstrap and calculate CI:s. That was good, and we managed to cut down on the amount of code we wrote, quite a bit.
If you try it yourself however, you’ll realise that the calculation takes a while. Not an absurd amount of time, especially when compared to other calculations that takes time for real, but it works as a good intro to parallel processing. Let’s take a look at how long the process takes, using the package tictoc, which measures the processor time it takes to go from the tic() to the toc()
set.seed(123)
library(tictoc)
tic()
complete_CI <- map_dfr(unique(voter$party), ci_calculation, voter)
toc()
## 10.95 sec elapsed
Not to slow, but imagine if you had a larger dataset with 10 000 votes instead, or you wanted to run far more simulations, then the time adds up. Instead, we can use a variation of the map_*() function, that takes each calculation and runs it in parallel instead of sequence. Here it’ll probably take a bit longer due to the plan(multiprocess), which takes some extra time to set up, but in a larger setting you’ll save time.
set.seed(123)
plan(multiprocess)
tic()
complete_CI_paralell <- future_map_dfr(unique(voter$party), ci_calculation, voter)
toc()
## 4.693 sec elapsed
There we go! Just a bit more than half the time! Just for fun, let’s look at how much time we actually saved, by putting the plan(multiprocess) between the tictoc.
set.seed(123)
tic()
plan(multiprocess)
complete_CI_paralell <- future_map_dfr(unique(voter$party), ci_calculation, voter)
toc()
## 8.596 sec elapsed
So only around a second here, but still, better! And more importantly, we’ve learned something.