Confidence Intervals

class: center, middle, inverse, title-slide

.title[
# Confidence Intervals
]
.author[
### Mahendra Mariadassou, INRAE <br> .small[from original slides by Tristan Mary-Huard]
]
.date[
### Shandong University, Weihai (CN)<br>Summer School 2023
]

---

---
class: middle, inverse, center

# Warm-up

## Refresher and technical results

---
## Prerequisites

.pull-left[
.blue[Gaussian variable manipulation]

- Assume `$X_1\sim \N\left(\mu_1,\sigma_1^2\right)$`.

.question[quizz1]

- What is the distribution of `$aX_1+b$` ? 
- By product: what is the distribution of `$\frac{X_1-\mu_1}{\sigma_1}$`

Assume `$X_2\sim \N\left(\mu_2,\sigma_2^2\right)$` and `$X_1\perp X_2$`.

.question[quizz2]
- What is the distribution of `$X_1+X_2$` ?
]

.col-right[
.blue[An important result]

- Assume `$X_1,...,X_n\sim\N(\mu,\sigma^2),$` i.i.d.. Then

`$$\frac{1}{\sigma^2}\sum_{i=1}^n (X_i- \overline{X})^2 \sim \chi^2(n-1)$$`
]

---
## Introducing the Student distribution

Let `$X\sim \N\left(0,1\right)$`,  `$U\sim \chi^2(n)$`, and `$X\perp U$`.

Define  random variable

`$$Z = \frac{X}{\sqrt{U/n}}.$$`
--

Then `$Z$` is said to have a Student distribution with degrees of freedom `$n$`.

One notes
`$$Z \sim \mathcal{T}(n).$$`

One can show that the distribution of `$Z$` is symmetric, _i.e._ its density function satisfies

`$$f(-z)= f(z).$$`

.question[quizz3]

---
## Student density

.question[quizz4]

---
## So far...
.blue[Estimation]

- Assuming `$X_1,...,X_n\sim\L(\theta)$`, i.i.d. one can estimate `$\theta$` through Maximum Likelihood (ML):

`$$\widehat{\theta} = \arg\max_{\theta} Lik_{\theta}(x_1,...,x_n)$$`

- On several examples, the ML estimator `$T$` of parameter `$\theta$` exhibits good properties:
  - low bias, 
  - MSE decreasing with `$n$`.

- Still, if e.g. `$X_1,...,X_n$` are continuous random variables, then
`$$P\left(T= \theta\right) = 0$$`

---
## Confidence interval

Instead of providing a single value, provide a **range** of values such that

$$ P(L \leq \theta \leq U) = 1 - \alpha$$
where

- `$L$` and `$U$` can be computed from the data,
- the **confidence level** `$\alpha$` is **chosen** by the user.

---
## Pivotal statistic

In the previous general framework where
`$$X_1,...,X_n\sim\L(\theta) \text{, i.i.d.}$$`
obtaining a confidence interval will require the following concepts:

.def[Definition]

Let `$T_{pv}$` be a score function (ie a function with values in `$\mathbb{R}$`) that depends on the data and on the unknown parameter `$\theta$`:

`$$T_{pv}=f(X_1,...,X_n,\theta)$$`

If the .blue[distribution] of `$T_{pv}$` is known and does not depend on `$\theta$` (or any other unknown quantity), then `$T_{pv}$` is called a **pivotal statistic**.

.blue[Two questions]
- How to find a pivotal statistic ?
- How to use it to build a confidence interval ?

---
class: middle, inverse, center

# Example 1

## Weighing scale precision

---
## Example 1: Weighing scale precision

A biologist needs a new weighing machine.

A manufacturer proposes a new machine for which the precision is claimed to be of `$10^{-3}\mu g$`.

The biologist performs a trial where the same object of weight 1 `$\mu g$` is weighted 15 times.

The obtained measurements are the following:

```
0.99569 0.99996 1.00011 1.00003 1.00172 
1 0.99672 1.00035 1.00453 0.99869 
1.0032 1.00238 0.99989 1.0008 1.00175 
```

Assuming the scale is unbiased, is the manufacturer honest ?
  - Is the accuracy small enough compared to `$10^{-3}\mu g$`

---
## Modeling

Denote `$W_i$` the measurement obtained for the `$i^{th}$` weighting.

- One assumes that measurements are independent:
`$$W_1\perp W_2\perp... \perp W_{15}$$`

- Measurements are continuous
`$$W_1,..., W_{15}  \sim \N\left(\bullet, \bullet\right), \text{ i.i.d}$$`

- The scale is unbiased
`$$W_1,..., W_{15} \sim \N(1,\bullet), \text{ i.i.d}$$`

- The scale precision is unknown
`$$W_1,..., W_{15} \sim \N(1,\sigma^2), \text{ i.i.d}$$`

.blue[Objective]: propose an interval for the scale precision `$\sigma$`.

---
## Estimation

In most cases `$T_{pv}$` is derived from the ML estimator of the quantity of interest...

`$$\begin{eqnarray*}
Lik_\sigma(w_1,...,w_n) &=& \prod_{i=1}^n f_\sigma(w_i) \quad\text{(i.i.d. assumption)}\\
  &=& \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\frac{(w_i-1)^2}{\sigma^2}\right\} \\
\Rightarrow LLik_\sigma(w_1,...,w_n) &=& -n\log(\sqrt{2\pi}) -n\log(\sigma) -\frac{1}{2\sigma^2}\sum_{i=1}^n (w_i-1)^2
\end{eqnarray*}$$`

.blue[Derivation]

`$$\begin{equation}
\frac{\partial LLik_\sigma(w_1,...,w_n)}{\partial \sigma}=  -\frac{n}{\sigma} + \frac{1}{\sigma^3}\sum_{i=1}^n (w_i-1)^2
\end{equation}$$`

Setting the derivative at 0, one gets: 
`$$\widehat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (w_i-1)^2$$`

---
## Pivotal statistic

The ML estimator of `$\sigma^2$` is `$S^2 = \frac{1}{n}\sum_i(W_i-1)^2$`.
What is the distribution of `$S^2$` ? .question[quizz5]

--
`$$\frac{n}{\sigma^2}S^2\sim \chi^2(n)$$`

Let `$T=\frac{n}{\sigma^2}S^2$`. One can observe that

- `$T$` depends on the data,

- `$T$` depends on `$\sigma^2$`,

- `$T$` has a known distribution that does not depend on `$\sigma^2$`,

.blue[Conclusion] `$\Rightarrow T$` is a pivotal statistic `$T\rightarrow T_{pv}$`.

---
## Probability interval

.pull-left[
One seeks values `$a$` and `$b$` such that
$$ P( a \leq T_{pv} \leq b) = 1-\alpha  $$
A possible strategy is to set

- `$a=c_{n,\frac{\alpha}{2}}$`

- `$b=c_{n,1-\frac{\alpha}{2}}$`

where `$c_{n,u}$` is the `$u$` order quantile of the `$\chi^2(n)$` distribution.
]

.pull-right[
![](05_Confidence_Interval_files/figure-html/unnamed-chunk-4-1.png)
]

---
## Probability interval (II)

One then has

`$$\begin{eqnarray*}
&& P\left(c_{n,\frac{\alpha}{2}} \leq \frac{n}{\sigma^2}S^2 \leq c_{n,1-\frac{\alpha}{2}}\right) = 1-\alpha \\
\Rightarrow && P\left( \frac{c_{n,\frac{\alpha}{2}}}{nS^2} \leq \frac{1}{\sigma^2} \leq \frac{c_{n,1-\frac{\alpha}{2}}}{nS^2} \right) = 1-\alpha \\
\Rightarrow && P\left( \frac{nS^2}{c_{n,1-\frac{\alpha}{2}}} \leq \sigma^2 \leq \frac{nS^2}{c_{n,\frac{\alpha}{2}}} \right) = 1-\alpha \\
\Rightarrow && P\left( \sigma \in \left[ \sqrt{\frac{nS^2}{c_{n,1-\frac{\alpha}{2}}}};\ \sqrt{\frac{nS^2}{c_{n,\frac{\alpha}{2}}}} \right] \right) = 1-\alpha
\end{eqnarray*}$$`

---
## Confidence interval

.blue[Application]

One has

- `$n=$` 15
- `$\widehat{\sigma}^2 = \frac{1}{n}\sum_i(w_i-1)^2= 4.95\times 10^{-6}$`

Assume one wants a confidence interval at level 95%, then
- `$\alpha=$` 0.05

- `$c_{15,0.025}=$` 6.26

- `$c_{15,0.975}=$` 27.49

Consequently, one has

`$$\text{IC}_{0.95}(\sigma) = \left[ 0.0016 ;\ 0.0034  \right].$$`

.blue[Conclusion ?]

---
## Some comments

Recall

`$$\begin{equation}
\text{IP}_{1-\alpha}(\sigma) = \left[ S\sqrt{\frac{n}{c_{n,1-\frac{\alpha}{2}}}};\ S \sqrt{\frac{n}{c_{n,\frac{\alpha}{2}}}} \right]
\end{equation}$$`

<img src="05_Confidence_Interval_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
`$n$` fixed at 15 and `$\alpha$` fixed at 0.05

---
class: middle, inverse, center

# Example 2

## Number of bacteriophages

---
## Example 2: Infection

A biologist runs an experiment to investigate the infection of a colony of bacteria by bacteriophages.

A given bacterium can be infected by one or several phages.

The biologist sampled 20 bacteria in the colony and observed the following number of phages per bacteria:

```
4 4 1 3 2 2 3 0 2 3 
2 3 4 1 2 4 5 0 2 2 
```

.blue[Objective]
Provide a confidence interval for the proportion of uninfected bacterias in the colony.

---
## Modeling

Denote `$X_i$` the number of phages obtained for the `$i^{th}$` bacterium.

- One assumes that bacteria are independent:

`$$X_1\perp X_2\perp... \perp X_{n},\quad\text{with } n= 20$$`

- Measurements are discrete

`$$X_1,..., X_{n}  \sim \P (\bullet), \text{ i.i.d}$$`

- The infection level is unknown

`$$X_1,..., X_{n} \sim \P(\lambda), \text{ i.i.d}$$`

.blue[Objective]

Propose a confidence interval for the proportion `$P(X=0)=e^{-\lambda}$`.

---
## Estimation

Starting point: derive the ML estimator for the quantity of interest.

`$$\begin{eqnarray}
Lik_\lambda(x_1,...,x_n) &=& \prod_{i=1}^n P_\lambda(X_i=x_i) \quad\text{(i.i.d. assumption)}\\
  &=& \prod_{i=1}^n \frac{\lambda^{x_i}}{x_i!}e^{-\lambda} \\
\Rightarrow LLik_\lambda(x_1,...,x_n) &=& \log(\lambda)\sum_{i=1}^{n}x_i - \sum_{i=1}^{n}\log(x_i!) - n\lambda
\end{eqnarray}$$`

.blue[Derivation]

`$$\frac{\partial LLik_\lambda(x_1,...,x_n)}{\partial \lambda}=  \frac{1}{\lambda}\sum_{i=1}^n x_i -n$$`

Setting the derivative at 0, one gets: `$\widehat{\lambda} = \frac{1}{n}\sum_{i=1}^n x_i$`.

---
## Pivotal statistic

The ML estimator of `$\lambda$` is `$\overline{X} = \frac{1}{n}\sum_i X_i$`.

What is the distribution of `$\overline{X}$` ?

`$$n\overline{X} \sim \P(n\lambda)$$`

Proof: your turn! (sum of independent Poisson)

One can observe that `$n\overline{X}$` depends on the data, but

- `$n\overline{X}$` does not depend on `$\lambda$`,

- `$n\overline{X}$` has a known distribution but it depends on `$\lambda$`,

`$\Rightarrow n\overline{X}$` is **not** a pivotal statistic.

---
## Gaussian approximation

.pull-left[

.blue[Central limit theorem]

Let `$X_1,...,X_n$` be `$n$` i.i.d. quantitative random variables with mean `$\mu$` and variance `$\sigma^2$`.

Let `$S_n = \sum_i X_i$`.

Then, for `$n$` large, one has

`$$\frac{S_n-n\mu}{\sigma\sqrt{n}} \overset{approx}{\sim} \N(0,1)$$`
]

.pull-right[
.blue[Application]

Back to our infection example, one has

`$$n\overline{X} = \sum_i X_i$$`

where the `$X_i$`'s are i.i.d. with `$E[X_i]=V[X_i]=\lambda$`.

Then

$$ T = \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \overset{approx}{\sim} \N(0,1)$$

(assuming `$n$` is... "big enough"!)

Then `$T$` is a pivotal statistic: `$T\rightarrow T_{pv}$`.
]

---
## Probability interval

.pull-left[
One seeks values `$a$` and `$b$` such that

$$ P( a \leq T_{pv} \leq b) = 1-\alpha  $$
An **optimal** strategy is to set

- `$a=u_{\frac{\alpha}{2}}=-u_{1-\frac{\alpha}{2}}$`

- `$b=u_{1-\frac{\alpha}{2}}$`

where `$u_{q}$` is the `$q$` order quantile of the `$\N(0,1)$` distribution.
]

.pull-right[
<img src="05_Confidence_Interval_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
]

---
## Probability interval (II)

`$$\begin{eqnarray}
&& P\left( -u_{1-\frac{\alpha}{2}} \leq \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha
\end{eqnarray}$$`

To obtain the probability interval, one then needs to "isolate" `$\lambda$`...

---
### First strategy: exact computation

Note that

`$$\begin{eqnarray}
&& P\left( -u_{1-\frac{\alpha}{2}} \leq \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha \\
\Rightarrow && P\left( \left| \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \right| \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha \\
\Rightarrow && P\left( \left( n\overline{X}-n\lambda \right)^2 \leq u_{1-\frac{\alpha}{2}}^2n\lambda \right) = 1-\alpha \\
\Rightarrow && P\left( n^2\lambda^2 - n\left(2n\overline{X} + u_{1-\frac{\alpha}{2}}^2\right)\lambda + n^2\overline{X}^2  \leq 0 \right) = 1-\alpha \\
\end{eqnarray}$$`

To get the lower and upper bounds one needs to find the solutions of the 2nd order equation. One obtains:

`$$P\left( \lambda \in \left[ \overline{X} + \frac{u_{1-\frac{\alpha}{2}}^2}{2n}\quad \pm \quad \frac{nu_{1-\frac{\alpha}{2}}\sqrt{u_{1-\frac{\alpha}{2}}^2 +4n\overline{X}}}{2n^2}       \right] \right) = 1 - \alpha$$`

---
### Second strategy: plug-in approximation

Proceed naively

`$$\begin{eqnarray}
&& P\left( -u_{1-\frac{\alpha}{2}} \leq \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha \\
\Rightarrow && P\left( -u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \leq n\overline{X}-n\lambda \leq u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \right) = 1-\alpha \\
\Rightarrow && P\left( n\overline{X}-u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \leq n\lambda \leq n\overline{X} + u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \right) = 1-\alpha \\
\Rightarrow && P\left( \overline{X}-u_{1-\frac{\alpha}{2}}\sqrt{\frac{\lambda}{n}} \leq \lambda \leq \overline{X} + u_{1-\frac{\alpha}{2}}\sqrt{\frac{\lambda}{n}} \right) = 1-\alpha \\
\end{eqnarray}$$`

Now replace `$\lambda$` in the bounds by its estimator...

`$$\begin{eqnarray}
&& P\left( \overline{X}-u_{1-\frac{\alpha}{2}}\sqrt{\frac{\overline{X}}{n}} \leq \lambda \leq \overline{X} + u_{1-\frac{\alpha}{2}}\sqrt{\frac{\overline{X}}{n}} \right) = 1-\alpha \\
\Rightarrow && P\left( \lambda \in\left[ \overline{X}\quad \pm \quad u_{1-\frac{\alpha}{2}}\sqrt{\frac{\overline{X}}{n}}  \right]   \right) = 1-\alpha
\end{eqnarray}$$`

---
## Confidence interval

.blue[Application]
One has

- `$n=$` 20

- `$\widehat{\lambda} = \frac{1}{n}\sum_ix_i= 2.45$`

Assume one wants a confidence interval at level 95%, then

- `$\alpha=$` 0.05

- `$u_{1-\frac{\alpha}{2}}=$` 1.96

Exact strategy : 
`$$\text{IC}_{0.95}(\lambda) = \left[ 1.85 ;\ 3.24  \right]$$`

Plug-in strategy : `$\text{IC}_{0.95}(\lambda) = \left[1.76 ;\ 3.14 \right].$`

.blue[Conclusion ?]

$$ P(X=0) = e^{-\lambda}\Rightarrow \text{IC}_{0.95}(e^{-\lambda}) = \left[ 0.04 ;\ 0.16  \right] \text{ or } \left[ 0.04 ;\ 0.17  \right]$$
}

---
## Summary

.blue[General strategy]

Assume `$X_1,...,X_n\sim \L(\theta)$` i.i.d.

To get a confidence interval for `$\theta$`

- 1/ Find the ML estimator `$T$` of `$\theta$`

- 2/ Find a pivotal statistic `$T_{pv}$` using
  - either the exact distribution of `$T$`,
  - or an approximate (normal) distribution based on CLT.

- 3/ Find a probability interval
  - either by isolating `$\theta$`,
  - or using the plug-in strategy wherever necessary.

- 4/ Choose a confidence level `$1-\alpha$` and compute the CI.

.blue[and don't forget...]

**estimator** `$\neq$` **estimate** and **probability** `$\neq$` **confidence** interval.

---
## Exercise 1: maize yield analysis

Consider here the grain yield measurement obtained from 18 dent lines:

```
16.84 16.03 16.16 14.64 12.55 18 17.01 
15.95 14.84 17.67 16.65 15.87 18.05 16.78 
11.94 14.55 13.68 14.08 
```

`$Hint:$` Quantiles tables are available in the following slides...

1/ Provide an exact probability interval for the yield variance in the dent population, and a confidence interval at level 0.95%.

2/ Provide an exact probability interval for the yield mean in the dent population, and a confidence interval at level 0.95%

---
## Solution

We model the yields `$Y_i$` as i.i.d variables with distribution `$\N(\mu,\sigma^2)$` and `$n = 18$`.

The estimates for the mean is `$\hat{\mu} =$` 15.63 and for the variance is `$\hat{\sigma}^2$` = 3.18.

Remember that

$$
\frac{1}{\sigma^2} \sum_{i=1}^n (Y_i - \bar{Y})^2 = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \quad \text{and} \quad \frac{\bar{Y} - \mu}{S/\sqrt{n}} \sim \mathcal{T}(n-1)
$$
.blue[Probability Intervals]

.pull-left[
`$$P\left( \sigma \in \left[ \sqrt{\frac{(n-1)S^2}{c_{n-1,1-\frac{\alpha}{2}}}};\ \sqrt{\frac{(n-1)S^2}{c_{n-1,\frac{\alpha}{2}}}} \right] \right) = 1-\alpha$$`

`$$\text{IC}_{0.95}(\sigma) = [1.38;  2.75]$$`
]

.col-right[
`$$P\left( \mu \in \left[ \bar{Y} \pm t_{n-1, \alpha/2} \frac{S}{\sqrt{n}} \right] \right) = 1-\alpha$$`
`$$\text{IC}_{0.95}(\mu) = [14.74;  16.51]$$`
]

---
#### Table of the  `$\N(0,1)$` quantiles

---
#### Table of the `$\chi^2\left(k\right)$` quantiles

---
#### Table of the  `$\mathcal{T}(k)$` quantiles