Confidence Intervals, simulated vs theoretical

2020-08-04 3 min read Stats

Assume we have two candicates, T and D, who are the two candidates running for election in a small village. You poll 1099 inhabitants, and of them 544 say the will vote for D, and 555 say they will vote for T.

In a table, it’ll look like this:

table(voter)

D	T
544	555

We’re fairly sure about the proportions of these votes, as the following:

D_prop <- 544/1099
D_prop

## [1] 0.4949955

T_prop <- 555/1099
T_prop

## [1] 0.5050045

Ie, around 49.5 % for D, and 50.5 % for T.

How sure are we about these results?

Confidence Intervals (CI:s) to the rescue!

First, lets look at simulated CI:s

D_1000 <- voter %>%
  specify(response = value, success = "D") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95) %>%
    mutate(party = "D",simulated = "Yes", reps = 1000)

T_1000 <- voter %>%
  specify(response = value, success = "T") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95) %>%
    mutate(party = "T", simulated = "Yes", reps = 1000)


ci_voters <- bind_rows(T_1000, D_1000)
ci_voters

lower_ci	upper_ci	party	simulated	reps
0.4749773	0.5341219	T	Yes	1000
0.4649682	0.5232257	D	Yes	1000

Ok, so if we simulate this result 1000 times using a bootstraping method, and a CI for 95 %, we’re 95 % sure that the true population mean is between 47.6 % to 53.4 % for T, with a mean of 50.5 %, and between 46.2 % and 52.2 % for D, with a mean of 49.5 %.

How does this compare with theoretical results? Lets use the following formula for CI:s:

\[\hat{p} \pm z \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

meaning that for T, \(\hat{p}\) = ca 0.505, z = ca 1.96 for a 95 % CI, and n = 1099.

p_hat <- voter %>%
  specify(response = value, success = "T") %>%
  calculate(stat = "prop") %>% pull(stat)

n <- 1099
z<- qnorm(0.975) #since BOTH tails are counted, and 2.5 % are at both end, this is the Z-score.



z*sqrt((p_hat*(1-p_hat))/n)

## [1] 0.02955953

T_theory <- tibble(lower_ci = p_hat-z*sqrt((p_hat*(1-p_hat))/n),
      upper_ci = p_hat+z*sqrt((p_hat*(1-p_hat))/n)) %>% 
    mutate(party = "T",
           simulated = "No",
           reps = 0)

kable(bind_rows(T_1000, T_theory),digits = 3)

lower_ci	upper_ci	party	simulated	reps
0.475	0.534	T	Yes	1000
0.475	0.535	T	No	0



```r
p_hat <- voter %>%
  specify(response = value, success = "D") %>%
  calculate(stat = "prop") %>% pull(stat)

n <- 1099
z<- 1.960


z*sqrt((p_hat*(1-p_hat))/n)

## [1] 0.02956007

D_theory <- tibble(lower_ci = p_hat-z*sqrt((p_hat*(1-p_hat))/n),
      upper_ci = p_hat+z*sqrt((p_hat*(1-p_hat))/n)) %>% 
    mutate(party = "D",
           simulated = "No",
           reps = 0)

kable(bind_rows(D_1000, D_theory), digits = 3)

lower_ci	upper_ci	party	simulated	reps
0.465	0.523	D	Yes	1000
0.465	0.525	D	No	0

As we can see, the CI:s are close for the simulated ones compared to the theoretical ones, but the code for the simulated ones are easier to follow and write.

Infer

Confidence Intervals, simulated vs theoretical

How sure are we about these results?

Leo Carlsson

Related