Looping CI:s with infer and Purrr
In this post, we’ll build on the former post, and ad a bit of iteration programming, understanding how we can reduce code for ease of readability and efficiency. Reducing code also reduces the points where problems can happen, since if it works on one place, it should work on all other places if it’s iterated, but if it’s just copied you have to change it on many places.
Imagine you have taken a poll, where there are many parties to choose between. Simulated, it would look something like this:
voter <- rep("S", 282)
voter <- append(voter, rep("V", 104))
voter <- append(voter, rep("MP", 34))
voter <- append(voter, rep("C", 80))
voter <- append(voter, rep("L", 37))
voter <- append(voter, rep("KD", 59))
voter <- append(voter, rep("M", 211))
voter <- append(voter, rep("SD", 179))
voter <- append(voter, rep("Other/no reply", 14))
voter <- tibble(party = voter)
Giving us a result like this:
Example poll | ||
---|---|---|
Party | # of votes | Share |
C | 80 | 8.00% |
KD | 59 | 5.90% |
L | 37 | 3.70% |
M | 211 | 21.10% |
MP | 34 | 3.40% |
Other/no reply | 14 | 1.40% |
S | 282 | 28.20% |
SD | 179 | 17.90% |
V | 104 | 10.40% |
Data from here |
This is based on a real poll, but sadly not the true data, since the poll had 1010 respondents and of these 4 % were unsure or chose not to reply, but it’ll do for this type of demonstration.
Say we have this poll, and want to know the Confidence intervals of the shares, ie how much do we think the share represents the true population proportions from this poll?
We could do it by hand (boring), or by using R (better), but writing it all would be quite repetitative, since it’s really just one word that changes. For instance, if we want to calculate the CI for the party C, we’d do this:
Calculating CI:s
p_hat <- voter %>%
mutate(party = fct_other(party, keep = "C")) %>%
specify(response = party, success = "C") %>%
calculate(stat = "prop") %>%
pull()
boot <- voter %>%
mutate(party = fct_other(party, keep = "C")) %>%
specify(response = party, success = "C") %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci() %>%
mutate(estimate = p_hat,
name = "C")
boot
## # A tibble: 1 x 4
## lower_ci upper_ci estimate name
## <dbl> <dbl> <dbl> <chr>
## 1 0.064 0.097 0.08 C
and then repeat it all, with each single party. Tedious, and a lot of steps with repetition. A smarter idea, is to loop over each level in the party!
Reducing repetitiveness
First, let’s turn calculating CI:s into a function:
ci_calculation <- function(to_keep, voter){
p_hat <- voter %>%
mutate(party = fct_other(party, keep = to_keep)) %>%
specify(response = party, success = to_keep) %>%
calculate(stat = "prop") %>%
pull()
voter%>%
mutate(party = fct_other(party, keep = to_keep)) %>%
specify(response = party, success = to_keep) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci() %>%
mutate(estimate = p_hat,
name = to_keep)
}
Let’s go through this piece by piece, just so we understand what has happened. First, we name the function ci_calculation
, then give it two arguments: to_keep
and voter
, where to_keep
is the party we want to get the CI:s for, and voter
is the data we want to use.
The function then does everything we did before, just with variables that can change instead. Example:
ci_calculation("C", voter)
## # A tibble: 1 x 4
## lower_ci upper_ci estimate name
## <dbl> <dbl> <dbl> <chr>
## 1 0.063 0.097 0.08 C
This did everything we did before, but with one row of code instead of 15, so instead of writing 15 lines of code for each of the 9 parties, we could write just 9 lines with only one difference.
But we can do even better! Using purrr
we can reduce these 9 rows to just 1 row. What’s purrr
, you may ask? It’s a way to apply a function in an iterative manner, over an index, like this:
complete_CI <- map_df(unique(voter$party), ci_calculation, voter)
complete_CI
## # A tibble: 9 x 4
## lower_ci upper_ci estimate name
## <dbl> <dbl> <dbl> <chr>
## 1 0.255 0.31 0.282 S
## 2 0.085 0.123 0.104 V
## 3 0.023 0.046 0.034 MP
## 4 0.063 0.097 0.08 C
## 5 0.026 0.049 0.037 L
## 6 0.045 0.074 0.059 KD
## 7 0.186 0.236 0.211 M
## 8 0.156 0.202 0.179 SD
## 9 0.007 0.022 0.014 Other/no reply
Here we use the function map_df
which takes a vector, and uses that vector as the first argument in a function. We give it unique(voter$party)
, which find’s each unique value in the column party
in the dataframe voter
, like this:
unique(voter$party)
## [1] "S" "V" "MP" "C"
## [5] "L" "KD" "M" "SD"
## [9] "Other/no reply"
The second argument is the function we wrote, ci_calculation
, and the third argument is the data we want to use, voter
. Note that the function ci_calculation
drops it’s parenthesis, and the third argument instead get’s fed into the function.
The resulting table still looks a bit hard to read, so let’s prettify it a bit using the package gt
!
Estimated results | |||
---|---|---|---|
Party | Estimated Result | Lower CI | Upper CI |
S | 28.20% | 25.50% | 31.00% |
V | 10.40% | 8.50% | 12.30% |
MP | 3.40% | 2.30% | 4.60% |
C | 8.00% | 6.30% | 9.70% |
L | 3.70% | 2.60% | 4.90% |
KD | 5.90% | 4.50% | 7.40% |
M | 21.10% | 18.60% | 23.60% |
SD | 17.90% | 15.60% | 20.20% |
Other/no reply | 1.40% | 0.70% | 2.20% |
Much better!