class: center, middle, inverse, title-slide .title[ # Confidence Intervals ] .author[ ### Mahendra Mariadassou, INRAE
.small[from original slides by Tristan Mary-Huard] ] .date[ ### Shandong University, Weihai (CN)
Summer School 2024 ] --- --- class: middle, inverse, center # Warm-up ## Refresher and technical results --- ## Prerequisites .pull-left[ .blue[Gaussian variable manipulation] - Assume `\(X_1\sim \N\left(\mu_1,\sigma_1^2\right)\)`. .question[quizz1] - What is the distribution of `\(aX_1+b\)` ? - By product: what is the distribution of `\(\frac{X_1-\mu_1}{\sigma_1}\)` Assume `\(X_2\sim \N\left(\mu_2,\sigma_2^2\right)\)` and `\(X_1\perp X_2\)`. .question[quizz2] - What is the distribution of `\(X_1+X_2\)` ? ] -- .col-right[ .blue[An important result] - Assume `\(X_1,...,X_n\sim\N(\mu,\sigma^2),\)` i.i.d.. Then `$$\frac{1}{\sigma^2}\sum_{i=1}^n (X_i- \overline{X})^2 \sim \chi^2(n-1)$$` ] --- ## Introducing the Student distribution Let `\(X\sim \N\left(0,1\right)\)`, `\(U\sim \chi^2(n)\)`, and `\(X\perp U\)`. Define random variable `$$Z = \frac{X}{\sqrt{U/n}}.$$` -- Then `\(Z\)` is said to have a Student distribution with degrees of freedom `\(n\)`. One notes `$$Z \sim \mathcal{T}(n).$$` -- One can show that the distribution of `\(Z\)` is symmetric, _i.e._ its density function satisfies `$$f(-z)= f(z).$$` .question[quizz3] --- ## Student density <img src="05_Confidence_Interval_files/figure-html/density_curves_df-1.png" style="display: block; margin: auto;" /> .question[quizz4] --- ## So far... .blue[Estimation] - Assuming `\(X_1,...,X_n\sim\L(\theta)\)`, i.i.d. one can estimate `\(\theta\)` through Maximum Likelihood (ML): `$$\widehat{\theta} = \arg\max_{\theta} Lik_{\theta}(x_1,...,x_n)$$` -- - On several examples, the ML estimator `\(T\)` of parameter `\(\theta\)` exhibits good properties: - low bias, - MSE decreasing with `\(n\)`. -- - Still, if e.g. `\(X_1,...,X_n\)` are continuous random variables, then `$$P\left(T= \theta\right) = 0$$` --- ## Confidence interval Instead of providing a single value, provide a **range** of values such that $$ P(L \leq \theta \leq U) = 1 - \alpha$$ where - `\(L\)` and `\(U\)` can be computed from the data, - the **confidence level** `\(\alpha\)` is **chosen** by the user. **Note:** `\(L\)` and `\(U\)` are .blue[random values] depending on `\(X_1, \dots, X_n\)`. --- ## Definition **Random interval** Let `\(L = m(X_1, ..., X_n)\)` et `\(U = M(X_1, ..., X_n)\)` two r.v. We define a .blue[random interval] `\([L, U]\)` for `\(\theta\)` with the couple `\((L, U)\)` and call `\(P(L < \theta < U)\)` the .blue[level of confidence]. -- **Confidence interval** A .blue[confidence interval] at level `\(1-\alpha\)` for `\(\theta\)` is a realisation `\([l, u]\)` of a random interval with confidence level `\(1-\alpha\)` -- **Notes:** - In general, we want the interval to be as .blue[narrow] as possible. - `\(\theta\)` is a fixed value and `\([L, U]\)` is random. For any realisation `\([l, u]\)` of `\([L, U]\)`, `\(\theta\)` is .blue[either] in the interval or not. --- ## Pivotal statistic In the previous general framework where `$$X_1,...,X_n\sim\L(\theta) \text{, i.i.d.}$$` obtaining a confidence interval will require the following concepts: -- .def[Definition] Let `\(T_{pv}\)` be a score function (ie a function with values in `\(\mathbb{R}\)`) that depends on the data and on the unknown parameter `\(\theta\)`: `$$T_{pv}=f(X_1,...,X_n,\theta)$$` If the .blue[distribution] of `\(T_{pv}\)` is known and does not depend on `\(\theta\)` (or any other unknown quantity), then `\(T_{pv}\)` is called a **pivotal statistic**. -- .blue[Two questions] - How to find a pivotal statistic ? - How to use it to build a confidence interval ? --- class: middle, inverse, center # Example 0 ## Mean of a gaussian with unit variance --- ## Mean of a gaussian with unit variance - **Model** `\(X_1, ..., X_n\)` gaussian with parameter `\(\mu\)` (unknown) and `\(\sigma^2 = 1\)` (known) - **Estimator** `$$\bar{X} = \sum X_i /n$$` - We assume for simplicity that `\(V(X_i) = \sigma^2 = 1\)` is known - Model for the estimator [using properties of gaussian / Central Limit Theorem] $$ \bar{X} \sim \mathcal{N}(\mu, 1/n) $$ Or equivalently $$ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim\mathcal{N}(0, 1) $$ -- We can use .blue[quantiles] of the standard gaussian to find bounds. --- ## Confidence Interval .pull-left[ With probability 0.95 $$ `\begin{equation} q_{0.025} \leq \frac{\bar{Y} - \mu}{\sigma/\sqrt{n}} \leq q_{0.975} \end{equation}` $$ Or with a bit of manipulation $$ `\begin{align} \mu & \in \left[ \bar{Y} - q_{0.975}\frac{\sigma}{\sqrt{n}}, \bar{Y} - q_{0.025}\frac{\sigma}{\sqrt{n}} \right] \\ \mu & \in \bar{Y} \pm 1.96 \frac{\sigma}{\sqrt{n}} \end{align}` $$ ] .pull-right[ <img src="05_Confidence_Interval_files/figure-html/unnamed-chunk-1-1.png" width="500px" /> ] --- ## A cautionary tale - .blue[On average] 95% of the intervals contain the true value. - But for a .blue[single] interval, the true value is .blue[either] in the interval or not in the interval <img src="05_Confidence_Interval_files/figure-html/unnamed-chunk-2-1.png" width="1000px" /> --- class: middle, inverse, center # Example 1 ## Weighting scale precision --- ## Example 1: Weighting scale precision A biologist needs a new weighting machine. -- A manufacturer proposes a new machine for which the precision is claimed to be of `\(10^{-3}\mu g\)`. -- The biologist performs a trial where the same object of weight 1 `\(\mu g\)` is weighted 15 times. -- The obtained measurements are the following: ``` 1.00138 1.00172 1.00163 0.99721 0.99802 1.00039 1.0023 1.00004 0.99884 1.00156 0.99943 1.00121 0.99826 1.00354 0.99925 ``` -- Assuming the scale is unbiased, is the manufacturer honest ? - Is the accuracy small enough compared to `\(10^{-3}\mu g\)` --- ## Modeling Denote `\(W_i\)` the measurement obtained for the `\(i^{th}\)` weighting. - One assumes that measurements are independent: `$$W_1\perp W_2\perp... \perp W_{15}$$` - Measurements are continuous `$$W_1,..., W_{15} \sim \N\left(\bullet, \bullet\right), \text{ i.i.d}$$` - The scale is unbiased `$$W_1,..., W_{15} \sim \N(1,\bullet), \text{ i.i.d}$$` - The scale precision is unknown `$$W_1,..., W_{15} \sim \N(1,\sigma^2), \text{ i.i.d}$$` -- .blue[Objective]: propose an interval for the scale precision `\(\sigma\)`. --- ## Estimation In most cases `\(T_{pv}\)` is derived from the ML estimator of the quantity of interest... `$$\begin{eqnarray*} Lik_\sigma(w_1,...,w_n) &=& \prod_{i=1}^n f_\sigma(w_i) \quad\text{(i.i.d. assumption)}\\ &=& \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\frac{(w_i-1)^2}{\sigma^2}\right\} \\ \Rightarrow LLik_\sigma(w_1,...,w_n) &=& -n\log(\sqrt{2\pi}) -n\log(\sigma) -\frac{1}{2\sigma^2}\sum_{i=1}^n (w_i-1)^2 \end{eqnarray*}$$` -- .blue[Derivation] `$$\begin{equation} \frac{\partial LLik_\sigma(w_1,...,w_n)}{\partial \sigma}= -\frac{n}{\sigma} + \frac{1}{\sigma^3}\sum_{i=1}^n (w_i-1)^2 \end{equation}$$` Setting the derivative at 0, one gets: `$$\widehat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (w_i-1)^2$$` --- ## Pivotal statistic The ML estimator of `\(\sigma^2\)` is `\(S^2 = \frac{1}{n}\sum_i(W_i-1)^2\)`. What is the distribution of `\(S^2\)` ? .question[quizz5] -- `$$\frac{n}{\sigma^2}S^2\sim \chi^2(n)$$` -- Let `\(T=\frac{n}{\sigma^2}S^2\)`. One can observe that - `\(T\)` depends on the data, - `\(T\)` depends on `\(\sigma^2\)`, - `\(T\)` has a known distribution that does not depend on `\(\sigma^2\)`, .blue[Conclusion] `\(\Rightarrow T\)` is a pivotal statistic `\(T\rightarrow T_{pv}\)`. --- ## Probability interval .pull-left[ One seeks values `\(a\)` and `\(b\)` such that $$ P( a \leq T_{pv} \leq b) = 1-\alpha $$ A possible strategy is to set - `\(a=c_{n,\frac{\alpha}{2}}\)` - `\(b=c_{n,1-\frac{\alpha}{2}}\)` where `\(c_{n,u}\)` is the `\(u\)` order quantile of the `\(\chi^2(n)\)` distribution. ] -- .pull-right[ ![](05_Confidence_Interval_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- ## Probability interval (II) One then has -- `$$\begin{eqnarray*} && P\left(c_{n,\frac{\alpha}{2}} \leq \frac{n}{\sigma^2}S^2 \leq c_{n,1-\frac{\alpha}{2}}\right) = 1-\alpha \\ \Rightarrow && P\left( \frac{c_{n,\frac{\alpha}{2}}}{nS^2} \leq \frac{1}{\sigma^2} \leq \frac{c_{n,1-\frac{\alpha}{2}}}{nS^2} \right) = 1-\alpha \\ \Rightarrow && P\left( \frac{nS^2}{c_{n,1-\frac{\alpha}{2}}} \leq \sigma^2 \leq \frac{nS^2}{c_{n,\frac{\alpha}{2}}} \right) = 1-\alpha \\ \Rightarrow && P\left( \sigma \in \left[ \sqrt{\frac{nS^2}{c_{n,1-\frac{\alpha}{2}}}};\ \sqrt{\frac{nS^2}{c_{n,\frac{\alpha}{2}}}} \right] \right) = 1-\alpha \end{eqnarray*}$$` --- ## Confidence interval .blue[Application] One has - `\(n=\)` 15 - `\(\widehat{\sigma}^2 = \frac{1}{n}\sum_i(w_i-1)^2= 3.09\times 10^{-6}\)` -- Assume one wants a confidence interval at level 95%, then - `\(\alpha=\)` 0.05 - `\(c_{15,0.025}=\)` 6.26 - `\(c_{15,0.975}=\)` 27.49 Consequently, one has `$$\text{IC}_{0.95}(\sigma) = \left[ 0.0013 ;\ 0.0027 \right].$$` .blue[Conclusion ?] --- ## Some comments Recall `$$\begin{equation} \text{IP}_{1-\alpha}(\sigma) = \left[ S\sqrt{\frac{n}{c_{n,1-\frac{\alpha}{2}}}};\ S \sqrt{\frac{n}{c_{n,\frac{\alpha}{2}}}} \right] \end{equation}$$` -- <img src="05_Confidence_Interval_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> `\(n\)` fixed at 15 and `\(\alpha\)` fixed at 0.05 --- class: middle, inverse, center # Example 2 ## Number of bacteriophages --- ## Example 2: Infection A biologist runs an experiment to investigate the infection of a colony of bacteria by bacteriophages. -- A given bacterium can be infected by one or several phages. -- The biologist sampled 20 bacteria in the colony and observed the following number of phages per bacteria: ``` 4 4 1 3 2 2 3 0 2 3 2 3 4 1 2 4 5 0 2 2 ``` -- .blue[Objective] Provide a confidence interval for the proportion of uninfected bacterias in the colony. --- ## Modeling Denote `\(X_i\)` the number of phages obtained for the `\(i^{th}\)` bacterium. - One assumes that bacteria are independent: `$$X_1\perp X_2\perp... \perp X_{n},\quad\text{with } n= 20$$` - Measurements are discrete `$$X_1,..., X_{n} \sim \P (\bullet), \text{ i.i.d}$$` - The infection level is unknown `$$X_1,..., X_{n} \sim \P(\lambda), \text{ i.i.d}$$` .blue[Objective] Propose a confidence interval for the proportion `\(P(X=0)=e^{-\lambda}\)`. --- ## Estimation Starting point: derive the ML estimator for the quantity of interest. `$$\begin{eqnarray} Lik_\lambda(x_1,...,x_n) &=& \prod_{i=1}^n P_\lambda(X_i=x_i) \quad\text{(i.i.d. assumption)}\\ &=& \prod_{i=1}^n \frac{\lambda^{x_i}}{x_i!}e^{-\lambda} \\ \Rightarrow LLik_\lambda(x_1,...,x_n) &=& \log(\lambda)\sum_{i=1}^{n}x_i - \sum_{i=1}^{n}\log(x_i!) - n\lambda \end{eqnarray}$$` .blue[Derivation] `$$\frac{\partial LLik_\lambda(x_1,...,x_n)}{\partial \lambda}= \frac{1}{\lambda}\sum_{i=1}^n x_i -n$$` Setting the derivative at 0, one gets: `\(\widehat{\lambda} = \frac{1}{n}\sum_{i=1}^n x_i\)`. --- ## Pivotal statistic The ML estimator of `\(\lambda\)` is `\(\overline{X} = \frac{1}{n}\sum_i X_i\)`. What is the distribution of `\(\overline{X}\)` ? `$$n\overline{X} \sim \P(n\lambda)$$` Proof: your turn! (sum of independent Poisson) -- One can observe that `\(n\overline{X}\)` depends on the data, but - `\(n\overline{X}\)` does not depend on `\(\lambda\)`, - `\(n\overline{X}\)` has a known distribution but it depends on `\(\lambda\)`, `\(\Rightarrow n\overline{X}\)` is **not** a pivotal statistic. --- ## Gaussian approximation .pull-left[ .blue[Central limit theorem] Let `\(X_1,...,X_n\)` be `\(n\)` i.i.d. quantitative random variables with mean `\(\mu\)` and variance `\(\sigma^2\)`. Let `\(S_n = \sum_i X_i\)`. Then, for `\(n\)` large, one has `$$\frac{S_n-n\mu}{\sigma\sqrt{n}} \overset{approx}{\sim} \N(0,1)$$` ] -- .pull-right[ .blue[Application] Back to our infection example, one has `$$n\overline{X} = \sum_i X_i$$` where the `\(X_i\)`'s are i.i.d. with `\(E[X_i]=V[X_i]=\lambda\)`. Then $$ T = \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \overset{approx}{\sim} \N(0,1)$$ (assuming `\(n\)` is... "big enough"!) Then `\(T\)` is a pivotal statistic: `\(T\rightarrow T_{pv}\)`. ] --- ## Probability interval .pull-left[ One seeks values `\(a\)` and `\(b\)` such that $$ P( a \leq T_{pv} \leq b) = 1-\alpha $$ An **optimal** strategy is to set - `\(a=u_{\frac{\alpha}{2}}=-u_{1-\frac{\alpha}{2}}\)` - `\(b=u_{1-\frac{\alpha}{2}}\)` where `\(u_{q}\)` is the `\(q\)` order quantile of the `\(\N(0,1)\)` distribution. ] -- .pull-right[ <img src="05_Confidence_Interval_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ## Probability interval (II) `$$\begin{eqnarray} && P\left( -u_{1-\frac{\alpha}{2}} \leq \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha \end{eqnarray}$$` To obtain the probability interval, one then needs to "isolate" `\(\lambda\)`... --- ### First strategy: exact computation Note that `$$\begin{eqnarray} && P\left( -u_{1-\frac{\alpha}{2}} \leq \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha \\ \Rightarrow && P\left( \left| \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \right| \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha \\ \Rightarrow && P\left( \left( n\overline{X}-n\lambda \right)^2 \leq u_{1-\frac{\alpha}{2}}^2n\lambda \right) = 1-\alpha \\ \Rightarrow && P\left( n^2\lambda^2 - n\left(2n\overline{X} + u_{1-\frac{\alpha}{2}}^2\right)\lambda + n^2\overline{X}^2 \leq 0 \right) = 1-\alpha \\ \end{eqnarray}$$` -- To get the lower and upper bounds one needs to find the solutions of the 2nd order equation. One obtains: `$$P\left( \lambda \in \left[ \overline{X} + \frac{u_{1-\frac{\alpha}{2}}^2}{2n}\quad \pm \quad \frac{nu_{1-\frac{\alpha}{2}}\sqrt{u_{1-\frac{\alpha}{2}}^2 +4n\overline{X}}}{2n^2} \right] \right) = 1 - \alpha$$` --- ### Second strategy: plug-in approximation Proceed naively `$$\begin{eqnarray} && P\left( -u_{1-\frac{\alpha}{2}} \leq \frac{n\overline{X}-n\lambda}{\sqrt{n\lambda}} \leq u_{1-\frac{\alpha}{2}} \right) = 1-\alpha \\ \Rightarrow && P\left( -u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \leq n\overline{X}-n\lambda \leq u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \right) = 1-\alpha \\ \Rightarrow && P\left( n\overline{X}-u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \leq n\lambda \leq n\overline{X} + u_{1-\frac{\alpha}{2}}\sqrt{n\lambda} \right) = 1-\alpha \\ \Rightarrow && P\left( \overline{X}-u_{1-\frac{\alpha}{2}}\sqrt{\frac{\lambda}{n}} \leq \lambda \leq \overline{X} + u_{1-\frac{\alpha}{2}}\sqrt{\frac{\lambda}{n}} \right) = 1-\alpha \\ \end{eqnarray}$$` -- Now replace `\(\lambda\)` in the bounds by its estimator... `$$\begin{eqnarray} && P\left( \overline{X}-u_{1-\frac{\alpha}{2}}\sqrt{\frac{\overline{X}}{n}} \leq \lambda \leq \overline{X} + u_{1-\frac{\alpha}{2}}\sqrt{\frac{\overline{X}}{n}} \right) = 1-\alpha \\ \Rightarrow && P\left( \lambda \in\left[ \overline{X}\quad \pm \quad u_{1-\frac{\alpha}{2}}\sqrt{\frac{\overline{X}}{n}} \right] \right) = 1-\alpha \end{eqnarray}$$` --- ## Confidence interval .blue[Application] One has - `\(n=\)` 20 - `\(\widehat{\lambda} = \frac{1}{n}\sum_ix_i= 2.45\)` Assume one wants a confidence interval at level 95%, then - `\(\alpha=\)` 0.05 - `\(u_{1-\frac{\alpha}{2}}=\)` 1.96 Exact strategy : `$$\text{IC}_{0.95}(\lambda) = \left[ 1.85 ;\ 3.24 \right]$$` Plug-in strategy : `\(\text{IC}_{0.95}(\lambda) = \left[1.76 ;\ 3.14 \right].\)` .blue[Conclusion ?] -- $$ P(X=0) = e^{-\lambda}\Rightarrow \text{IC}_{0.95}(e^{-\lambda}) = \left[ 0.04 ;\ 0.16 \right] \text{ or } \left[ 0.04 ;\ 0.17 \right]$$ } --- ## Summary .blue[General strategy] Assume `\(X_1,...,X_n\sim \L(\theta)\)` i.i.d. To get a confidence interval for `\(\theta\)` - 1/ Find the ML estimator `\(T\)` of `\(\theta\)` - 2/ Find a pivotal statistic `\(T_{pv}\)` using - either the exact distribution of `\(T\)`, - or an approximate (normal) distribution based on CLT. - 3/ Find a probability interval - either by isolating `\(\theta\)`, - or using the plug-in strategy wherever necessary. - 4/ Choose a confidence level `\(1-\alpha\)` and compute the CI. -- .blue[and don't forget...] **estimator** `\(\neq\)` **estimate** and **probability** `\(\neq\)` **confidence** interval. --- ## Exercise 1: maize yield analysis Consider here the grain yield measurement obtained from 18 dent lines: ``` 16.84 16.03 16.16 14.64 12.55 18 17.01 15.95 14.84 17.67 16.65 15.87 18.05 16.78 11.94 14.55 13.68 14.08 ``` `\(Hint:\)` Quantiles tables are available in the following slides... -- 1/ Provide an exact probability interval for the yield variance in the dent population, and a confidence interval at level 0.95%. 2/ Provide an exact probability interval for the yield mean in the dent population, and a confidence interval at level 0.95% --- ## Solution -- We model the yields `\(Y_i\)` as i.i.d variables with distribution `\(\N(\mu,\sigma^2)\)` and `\(n = 18\)`. The estimates for the mean is `\(\hat{\mu} =\)` 15.63 and for the variance is `\(\hat{\sigma}^2\)` = 3.18. Remember that $$ \frac{1}{\sigma^2} \sum_{i=1}^n (Y_i - \bar{Y})^2 = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \quad \text{and} \quad \frac{\bar{Y} - \mu}{S/\sqrt{n}} \sim \mathcal{T}(n-1) $$ .blue[Probability Intervals] .pull-left[ `$$P\left( \sigma \in \left[ \sqrt{\frac{(n-1)S^2}{c_{n-1,1-\frac{\alpha}{2}}}};\ \sqrt{\frac{(n-1)S^2}{c_{n-1,\frac{\alpha}{2}}}} \right] \right) = 1-\alpha$$` `$$\text{IC}_{0.95}(\sigma) = [1.38; 2.75]$$` ] .col-right[ `$$P\left( \mu \in \left[ \bar{Y} \pm t_{n-1, \alpha/2} \frac{S}{\sqrt{n}} \right] \right) = 1-\alpha$$` `$$\text{IC}_{0.95}(\mu) = [14.74; 16.51]$$` ] --- #### Table of the `\(\N(0,1)\)` quantiles <img src="Figures/QuantilesNormaleCentreeReduite.png" width="587" style="display: block; margin: auto;" /> --- #### Table of the `\(\chi^2\left(k\right)\)` quantiles <img src="Figures/TableQuantilesKhi2.png" width="886" style="display: block; margin: auto;" /> --- #### Table of the `\(\mathcal{T}(k)\)` quantiles <img src="Figures/TableQuantilesStudent.png" width="863" style="display: block; margin: auto;" />