## Exercise 1 (Simulating a Sampling Distribution)
Perform three simulation studies:
1. $n = 5$
2. $n = 25$
3. $n = 50$
For each sample size, generate data from an exponential distribution with a rate parameter of 2. (That is, a mean of 0.5.)
$$
X_1, \ldots, X_n \sim \text{Exp}(\lambda = 2)
$$
We are interested in the sampling distribution of
$$
\hat{\theta} = \bar{X} = \frac{1}{n}\sum_{i = 1}^{n}X_i
$$
For each study, perform at least 1000 simulations, each time storing the value of the estimator. Plot a histogram of the simulated estimator values and overlay the approximate large sample distribution of the estimator, that is
$$
\bar{X} \approx N \left(\mathbb{E[X]}, \frac{\mathbb{V}[X]}{n} \right)
$$
Your final answer should be three side-by-side histograms, each with an overlay of a density.
- Note: This exercise is hinting at how large $n$ needs to be to approximate the sampling distribution of the sample mean with a normal distribution as suggested by our discussions of the CLT.
## Exercise 2 (How Large is Large?)
Return to your results from Exercise 1. Instead of plotting histograms with a density, plot the empirical CDF and the approximate normal CDF for each of the simulation studies. Based on these results, comment on which of these values of $n$ you would feel comfortable using the normal distribution as an approximation for the true sampling distribution.
## Exercise 3 (Distribution of a Bootstrap Sample)
Let $X_1, X_2, \ldots, X_n$ be distinct observations, that is, no ties. Let $X_1^\star, X_2^\star, \ldots, X_n^\star$ denote a bootstrap sample and let
$$
\bar{X}_n^\star = \frac{1}{n}\sum_{i = 1}^{n}X_i^\star.
$$
Find:
- $\mathbb{E}\left[ \bar{X}_n^\star \mid X_1, X_2, \ldots, X_n \right]$
- $\mathbb{V}\left[ \bar{X}_n^\star \mid X_1, X_2, \ldots, X_n \right]$
- $\mathbb{E}\left[ \bar{X}_n^\star \right]$
- $\mathbb{V}\left[ \bar{X}_n^\star \right]$
## Exercise 4 (Professor Salaries)
The following code loads data about Professor salaries. (Check the documentation for details.) We will be interested in the `salary` variable.
```{r}
salaries = carData::Salaries
```
Define $\theta = T(F) = q_{0.25}$. Create a 95% confidence interval for the 25th percentile of professor salaries using each of the three bootstrap interval methods: Normal, Pivotal, Percentile.
Use as least 2000 bootstrap samples for each interval.
- Note 1: To obtain a "plug-in" estimate for $q_{p}$ you may simply use the default arguments to R's `quantile()` function.
- Note 2: These salaries are a few years old, but for **Tenure Track** faculty. Not all of your instructors fall into this category.
- Fun Fact: Illinois is a state institution, so salary information is public. We leave it as an exercise to the read to find this data.
## Exercise 5 (How Long Will You Survive Cancer?)
For this exercise we will use the `Melanoma` data from the `MASS` package.
```{r}
head(MASS::Melanoma)
```
We’ll focus on the `time` variable which is survival time in days.
```{r}
mel_survive = MASS::Melanoma$time
```
```{r}
hist(mel_survive, col = "darkgrey",
xlab = "Survival (Days)", main = "Histogram of Melanoma Survival")
box()
grid()
```
Let $X$ be the survival time in **years** and define
$$
\theta = T(F) = P(X > 5).
$$
Create a 95% percentile bootstrap confidence interval for $\theta$, the probability of surviving longer that 5 years. Use at least 20000 bootstrap samples. Also plot a histogram of the bootstrap replicates and overlay the large-sample approximate **estimated** sampling distribution.
## Exercise 6 (Deflategate)
On January 18, 2015, Clete Blakeman measured the pressure in pounds per square inch (PSI) of 15 footballs during halftime of the AFC Championship game. Of these footballs, 11 were a sample from the New England Patriots. The remaining 4 were a sample from the Indianapolis Colts. The data follows:
```{r}
pats = c(11.50, 10.85, 11.15, 10.70, 11.10, 11.60, 11.85, 11.10, 10.95, 10.50, 10.90)
colts = c(12.70, 12.75, 12.50, 12.55)
```
Use the percentile method to create a 95% confidence interval for the difference in medians of the pressure of the Patriot's and Colt's footballs. Use at least 2000 bootstrap samples.
- Note 1: These sample sizes are probably too small.
- Note 2: This is not a rigorous enough analysis to discredit the Patriots. However, disliking the Patriots is totally normal and acceptable! For actual details, see the [Wells Report](https://a.espncdn.com/pdf/2015/0506/PatriotsWellsReport.pdf). (Be aware that the report contains text messages that use some not so pleasant language.)
- Note 3: This is not "tidy" data, but for this example, it is much easy to work with.
## Exercise 7 (Rank Correlation)
The following loads the `airquality` data and then removes any missing data.
```{r}
aq = na.omit(airquality)
```
Use the percentile method to create a 90% confidence interval for the Spearman rank correlation between `Ozone` and `Wind`. Use at least 2000 bootstrap samples.
## Exercise 8 (Bootstrap Replicates and the Sampling Distribution)
The following code generates data.
```{r}
set.seed(42)
some_data = rnorm(n = 100, mean = 5, sd = 1)
```
Suppose that we were interested in estimating $\theta = e^\mu$ and wanted to consider the estimator
$$
\hat{\theta} = e^{\bar{X}}.
$$
Generate 2000 (or more) bootstrap replicates of this estimator. Plot a histogram of these bootstrap replicates and overlay the **true** sampling distribution of $\hat{\theta}$.
Hint: Note that the distribution of $\bar{X}$ is normal, thus the distribution of $\hat{\theta}$ is a well known distribution that we have seen before. (Consider returning to the Homework 02 solutions.) The `dlnorm` function might be worth looking into.
## Exercise 9 (The Bootstrap is Not Magic)
Based on Exercise 8, you might not yet be convinced that the empirical distribution of bootstrap replicates is a good estimate of the true sampling distribution.
Repeat Exercise 9 three additional times with the data provided below. For each, plot a histogram of these bootstrap replicates and overlay the **true** sampling distribution of $\hat{\theta}$. Additionally, plot the empirical CDF of the bootstrap replicates as well as the CDF of the **true** sampling distribution of $\hat{\theta}$.
Your answer should be a total of six plots:
- Three histograms with a density, preferably side-by-side.
- Three plots, each with two CDFs, preferably side-by-side.
```{r}
set.seed(17)
data_1 = rnorm(n = 100, mean = 5, sd = 1)
data_2 = rnorm(n = 100, mean = 5, sd = 1)
data_3 = rnorm(n = 100, mean = 5, sd = 1)
```
## Exercise 10 (Bootstrap Coverage)
Perform a simulation study to assess the coverage of the three bootstrap confidence interval methods we have discussed: Normal, Pivotal, Percentile
Let $n = 50$ and
$$
T(F) = \int \frac{(x - \mu) ^ 3}{\sigma^3} dF(x).
$$
Generate $Y_1, Y_2, \ldots Y_n \sim N(0, 1)$ and set $X_i = e^{Y_i}$ for $i = 1, 2, \ldots n$. With this sample $X_1, \ldots X_n$ construct a 95% confidence interval for $T(F)$ using each of the three methods. Use at least 500 bootstrap samples for each, but more is better.
Repeat this process at least 1000 times. Use the results to assess the coverage of the three interval types.
## Exercise 11 (More Bootstrap Coverage)
Repeat the above exercise, but also report the average length of the three interval types in addition to their coverage. This time use random samples of size $n = 25$ from a $t$ distribution with 3 degrees of freedom. That is
$$
X_1, \ldots X_n \sim t_3
$$
Let
$$
\theta = T(F) = (q_{0.75} - q_{0.25}) / 1.34
$$
To obtain a "plug-in" estimate for $q_{p}$ you may simply use the default arguments to R's `quantile()` function.