Confidence Intervals, simulated vs theoretical

Assume we have two candicates, T and D, who are the two candidates running for election in a small village. You poll 1099 inhabitants, and of them 544 say the will vote for D, and 555 say they will vote for T.

In a table, it’ll look like this:

table(voter)
D T
544 555

We’re fairly sure about the proportions of these votes, as the following:

D_prop <- 544/1099
D_prop
## [1] 0.4949955
T_prop <- 555/1099
T_prop
## [1] 0.5050045

Ie, around 49.5 % for D, and 50.5 % for T.

How sure are we about these results?

Confidence Intervals (CI:s) to the rescue!

First, lets look at simulated CI:s

D_1000 <- voter %>%
  specify(response = value, success = "D") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95) %>%
    mutate(party = "D",simulated = "Yes", reps = 1000)

T_1000 <- voter %>%
  specify(response = value, success = "T") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95) %>%
    mutate(party = "T", simulated = "Yes", reps = 1000)


ci_voters <- bind_rows(T_1000, D_1000)
ci_voters
lower_ci upper_ci party simulated reps
0.4749773 0.5341219 T Yes 1000
0.4649682 0.5232257 D Yes 1000

Ok, so if we simulate this result 1000 times using a bootstraping method, and a CI for 95 %, we’re 95 % sure that the true population mean is between 47.6 % to 53.4 % for T, with a mean of 50.5 %, and between 46.2 % and 52.2 % for D, with a mean of 49.5 %.

How does this compare with theoretical results? Lets use the following formula for CI:s:

\[\hat{p} \pm z \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

meaning that for T, \(\hat{p}\) = ca 0.505, z = ca 1.96 for a 95 % CI, and n = 1099.

p_hat <- voter %>%
  specify(response = value, success = "T") %>%
  calculate(stat = "prop") %>% pull(stat)

n <- 1099
z<- qnorm(0.975) #since BOTH tails are counted, and 2.5 % are at both end, this is the Z-score.



z*sqrt((p_hat*(1-p_hat))/n)
## [1] 0.02955953
T_theory <- tibble(lower_ci = p_hat-z*sqrt((p_hat*(1-p_hat))/n),
      upper_ci = p_hat+z*sqrt((p_hat*(1-p_hat))/n)) %>% 
    mutate(party = "T",
           simulated = "No",
           reps = 0)

kable(bind_rows(T_1000, T_theory),digits = 3)
lower_ci upper_ci party simulated reps
0.475 0.534 T Yes 1000
0.475 0.535 T No 0


```r
p_hat <- voter %>%
  specify(response = value, success = "D") %>%
  calculate(stat = "prop") %>% pull(stat)

n <- 1099
z<- 1.960


z*sqrt((p_hat*(1-p_hat))/n)
## [1] 0.02956007
D_theory <- tibble(lower_ci = p_hat-z*sqrt((p_hat*(1-p_hat))/n),
      upper_ci = p_hat+z*sqrt((p_hat*(1-p_hat))/n)) %>% 
    mutate(party = "D",
           simulated = "No",
           reps = 0)

kable(bind_rows(D_1000, D_theory), digits = 3)
lower_ci upper_ci party simulated reps
0.465 0.523 D Yes 1000
0.465 0.525 D No 0

As we can see, the CI:s are close for the simulated ones compared to the theoretical ones, but the code for the simulated ones are easier to follow and write.

Leo Carlsson
Leo Carlsson

My research interests include science, statistics and politics.

Related