Paralell processing map()
In this post, we yet again build on the former post, this time to understand parallel processing a bit more. Remember the poll from the last post, where we calculated CI:s from a poll?
Where the results looked like this?
Example poll | ||
---|---|---|
Party | # of votes | Share |
C | 80 | 8.00% |
KD | 59 | 5.90% |
L | 37 | 3.70% |
M | 211 | 21.10% |
MP | 34 | 3.40% |
Other/no reply | 14 | 1.40% |
S | 282 | 28.20% |
SD | 179 | 17.90% |
V | 104 | 10.40% |
Data from here |
Good. In it, we calculated the CI:s using purrr
and a function we’d written that used infer
to bootstrap and calculate CI:s. That was good, and we managed to cut down on the amount of code we wrote, quite a bit.
If you try it yourself however, you’ll realise that the calculation takes a while. Not an absurd amount of time, especially when compared to other calculations that takes time for real, but it works as a good intro to parallel processing. Let’s take a look at how long the process takes, using the package tictoc
, which measures the processor time it takes to go from the tic()
to the toc()
set.seed(123)
library(tictoc)
tic()
complete_CI <- map_dfr(unique(voter$party), ci_calculation, voter)
toc()
## 10.95 sec elapsed
Not to slow, but imagine if you had a larger dataset with 10 000 votes instead, or you wanted to run far more simulations, then the time adds up. Instead, we can use a variation of the map_*()
function, that takes each calculation and runs it in parallel instead of sequence. Here it’ll probably take a bit longer due to the plan(multiprocess)
, which takes some extra time to set up, but in a larger setting you’ll save time.
set.seed(123)
plan(multiprocess)
tic()
complete_CI_paralell <- future_map_dfr(unique(voter$party), ci_calculation, voter)
toc()
## 4.693 sec elapsed
There we go! Just a bit more than half the time! Just for fun, let’s look at how much time we actually saved, by putting the plan(multiprocess)
between the tictoc.
set.seed(123)
tic()
plan(multiprocess)
complete_CI_paralell <- future_map_dfr(unique(voter$party), ci_calculation, voter)
toc()
## 8.596 sec elapsed
So only around a second here, but still, better! And more importantly, we’ve learned something.