Looping CI:s with infer and Purrr

2020-08-07 11 min read Stats

In this post, we’ll build on the former post, and ad a bit of iteration programming, understanding how we can reduce code for ease of readability and efficiency. Reducing code also reduces the points where problems can happen, since if it works on one place, it should work on all other places if it’s iterated, but if it’s just copied you have to change it on many places.

Imagine you have taken a poll, where there are many parties to choose between. Simulated, it would look something like this:

voter <- rep("S", 282)
voter <- append(voter, rep("V", 104))
voter <- append(voter, rep("MP", 34))
voter <- append(voter, rep("C", 80))
voter <- append(voter, rep("L", 37))
voter <- append(voter, rep("KD", 59))
voter <- append(voter, rep("M", 211))
voter <- append(voter, rep("SD", 179))
voter <- append(voter, rep("Other/no reply", 14))
voter <- tibble(party = voter)

Giving us a result like this:

Example poll

Party	# of votes	Share
C	80	8.00%
KD	59	5.90%
L	37	3.70%
M	211	21.10%
MP	34	3.40%
Other/no reply	14	1.40%
S	282	28.20%
SD	179	17.90%
V	104	10.40%
Data from here

This is based on a real poll, but sadly not the true data, since the poll had 1010 respondents and of these 4 % were unsure or chose not to reply, but it’ll do for this type of demonstration.

Say we have this poll, and want to know the Confidence intervals of the shares, ie how much do we think the share represents the true population proportions from this poll?

We could do it by hand (boring), or by using R (better), but writing it all would be quite repetitative, since it’s really just one word that changes. For instance, if we want to calculate the CI for the party C, we’d do this:

Calculating CI:s

p_hat <- voter %>%
    mutate(party = fct_other(party, keep = "C")) %>%
    specify(response = party, success = "C") %>%
    calculate(stat = "prop") %>% 
    pull()

boot <- voter %>%
    mutate(party = fct_other(party, keep = "C")) %>%
    specify(response = party, success = "C") %>%
    generate(reps = 10000, type = "bootstrap") %>%
    calculate(stat = "prop") %>% 
    get_ci() %>% 
    mutate(estimate = p_hat,
           name = "C")
boot

## # A tibble: 1 x 4
##   lower_ci upper_ci estimate name 
##      <dbl>    <dbl>    <dbl> <chr>
## 1    0.064    0.097     0.08 C

and then repeat it all, with each single party. Tedious, and a lot of steps with repetition. A smarter idea, is to loop over each level in the party!

Reducing repetitiveness

First, let’s turn calculating CI:s into a function:

ci_calculation <- function(to_keep, voter){
  p_hat <- voter %>%
    mutate(party = fct_other(party, keep = to_keep)) %>%
    specify(response = party, success = to_keep) %>%
    calculate(stat = "prop") %>% 
    pull()
  
  voter%>% 
    mutate(party = fct_other(party, keep = to_keep)) %>%
    specify(response = party, success = to_keep) %>%
    generate(reps = 10000, type = "bootstrap") %>%
    calculate(stat = "prop") %>% 
    get_ci() %>% 
    mutate(estimate = p_hat,
           name = to_keep)
}

Let’s go through this piece by piece, just so we understand what has happened. First, we name the function ci_calculation, then give it two arguments: to_keep and voter, where to_keep is the party we want to get the CI:s for, and voter is the data we want to use.

The function then does everything we did before, just with variables that can change instead. Example:

ci_calculation("C", voter)

## # A tibble: 1 x 4
##   lower_ci upper_ci estimate name 
##      <dbl>    <dbl>    <dbl> <chr>
## 1    0.063    0.097     0.08 C

This did everything we did before, but with one row of code instead of 15, so instead of writing 15 lines of code for each of the 9 parties, we could write just 9 lines with only one difference.

But we can do even better! Using purrr we can reduce these 9 rows to just 1 row. What’s purrr, you may ask? It’s a way to apply a function in an iterative manner, over an index, like this:

complete_CI <- map_df(unique(voter$party), ci_calculation, voter)
complete_CI

## # A tibble: 9 x 4
##   lower_ci upper_ci estimate name          
##      <dbl>    <dbl>    <dbl> <chr>         
## 1    0.255    0.31     0.282 S             
## 2    0.085    0.123    0.104 V             
## 3    0.023    0.046    0.034 MP            
## 4    0.063    0.097    0.08  C             
## 5    0.026    0.049    0.037 L             
## 6    0.045    0.074    0.059 KD            
## 7    0.186    0.236    0.211 M             
## 8    0.156    0.202    0.179 SD            
## 9    0.007    0.022    0.014 Other/no reply

Here we use the function map_df which takes a vector, and uses that vector as the first argument in a function. We give it unique(voter$party), which find’s each unique value in the column party in the dataframe voter, like this:

unique(voter$party)

## [1] "S"              "V"              "MP"             "C"             
## [5] "L"              "KD"             "M"              "SD"            
## [9] "Other/no reply"

The second argument is the function we wrote, ci_calculation, and the third argument is the data we want to use, voter. Note that the function ci_calculation drops it’s parenthesis, and the third argument instead get’s fed into the function.

The resulting table still looks a bit hard to read, so let’s prettify it a bit using the package gt!

Estimated results

Party	Estimated Result	Lower CI	Upper CI
S	28.20%	25.50%	31.00%
V	10.40%	8.50%	12.30%
MP	3.40%	2.30%	4.60%
C	8.00%	6.30%	9.70%
L	3.70%	2.60%	4.90%
KD	5.90%	4.50%	7.40%
M	21.10%	18.60%	23.60%
SD	17.90%	15.60%	20.20%
Other/no reply	1.40%	0.70%	2.20%

Much better!

Infer Purrr

Looping CI:s with infer and Purrr

Calculating CI:s

Reducing repetitiveness

Leo Carlsson

Related