class: center, middle, inverse, title-slide .title[ # Bivariate Analysis ] .author[ ### Mahendra Mariadassou, INRAE
.small[from original slides by Tristan Mary-Huard] ] .date[ ### Shandong University, Weihai (CN)
Summer School 2023 ] --- ## The AgroParisTech dataset .question[quizz0]
--- class: middle, inverse, center # Motivating example --- ## The AgroParisTech dataset - 731 individuals - 2 **qualitative** variables: Gender, Year - 3 **quantitative** variables: Height, Weight, Age -- How can we investigate the .blue[joint] distribution of 2 descriptors in a population ? -- .blue[3 kinds of joint analysis]: .pull-left[ - qualitative - qualitative - quantitative - qualitative - quantitative - quantitative ] .pull-right[ - Ex: Gender and Year - Ex: Height and Gender - Ex: Height and Weight ] --- class: middle, inverse, center # Qualitative - Qualitative --- ## Qualitative - Qualitative The couple (`Gender`,`Year`) is directly described through its joint distribution `$$P\left(G=g\bigcap Y=y \right) = \frac{n_{gy}}{n} \ \ .$$` The contingency table displays exhaustive information: .question[Quizz1] .pull-left[ ``` Year F M Total 1A 251 108 359 2A 119 36 155 3A 145 72 217 Total 515 216 731 ``` ] -- .col-right[ ``` Year F M Total 1A 0.343 0.148 0.491 2A 0.163 0.049 0.212 3A 0.198 0.098 0.297 Total 0.705 0.295 1.000 ``` ] -- .pull-left[ ``` Year F M Total 1A 0.699 0.301 1 2A 0.768 0.232 1 3A 0.668 0.332 1 ``` ] -- .col-right[ ``` Year F M 1A 0.487 0.500 2A 0.231 0.167 3A 0.282 0.333 Total 1.000 1.000 ``` ] --- #### Graphical representations .pull-left[ - All counts ![](02_BivariateAnalysis_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] -- .col-right[ - Percentages proportional to area ![](02_BivariateAnalysis_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] -- .pull-left[ - Conditional probabilities (by Gender) ![](02_BivariateAnalysis_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] -- .col-right[ - Conditional probabilities (by Year) ![](02_BivariateAnalysis_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] --- class: middle, inverse, center # Quantitative - Qualitative --- ## Quantitative - Qualitative Each level of variable `Gender` defines a sub-population, in which variable `Height` can be described. .pull-left[ <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-11-1.png" height="400px" /> ] .pull-right[ <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-12-1.png" height="400px" /> ] - Same graphical tool as for 1 population, but... --- ## Quantitative - Qualitative Each level of variable `Gender` defines a sub-population, in which variable `Height` can be described. .pull-left[ <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-13-1.png" height="400px" /> ] .pull-right[ <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-14-1.png" height="400px" /> ] - Same graphical tool as for 1 population, but .blue[pay attention] to scaling effect... --- ## Quantitative - Qualitative Each level of variable `Gender` defines a sub-population, in which variable `Height` can be described. .pull-left[ <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-15-1.png" height="400px" /> ] .pull-right[ <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-16-1.png" height="400px" /> ] - Same graphical tool as for 1 population, but .blue[pay attention] to scaling effect or .blue[avoid] them --- class: middle, inverse, center # Quantitative - Quantitative --- ## Quantitative - Quantitative As for the univariate case, when dealing with continuous variables the joint distribution cannot be explored exhaustively. -- **Nevertheless**, graphical representations can be produced: <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> -- The relationship between Height and Weight looks quite linear. .blue[Question]: How can the linearity of the relationship be quantified ? --- ## Covariance .blue[Definition:] The covariance `\(\sigma_{X,Y}\)` between two quantitative variables `\(X\)` and `\(Y\)` is `$$\sigma_{X,Y} = \sum_{i}\sum_{j}\left(x_i-E(X)\right)\left(y_j-E(Y)\right)P\left(X=X_i,Y=y_j\right)$$` for quantitative discrete variables, and `$$\sigma_{X,Y} = \int_{x}\int_{y}\left(x-E(X)\right)\left(y-E(Y)\right)\times f_{X,Y}(x,y)dxdy$$` for continuous variables. .question[Quizz2] -- .blue[Examples:] - Covariance between Height and Weight: `\(\sigma_{H,W} = 45.9099597\)` - Covariance between Height (in cm) and Weight: `\(\sigma_{H,W} = 4590.995969\)` - Covariance between Weight and Age: `\(\sigma_{H,A} = 0.2793759\)` -- .blue[Conclusion:] Scaling makes covariance difficult to interpret. --- ## Correlation .blue[Definition:] The correlation `\(\rho_{X,Y}\)` between two quantitative variables `\(X\)` and `\(Y\)` is .question[Quizz3] `$$\rho_{X,Y} = \frac{\sigma_{X,Y}}{\sigma_X \times \sigma_Y} \ \ .$$` -- Division by the standard deviation `\(\Rightarrow\)` get rid of the scaling effect. -- .def[Property:] `\(\rho_{X,Y} \in [-1,\ 1]\)` - `\(\rho_{X,Y} \approx 1\)` `\(\Rightarrow\)` positive linear relationship between `\(X\)` and `\(Y\)`, - `\(\rho_{X,Y} \approx -1\)` `\(\Rightarrow\)` negative linear relationship between `\(X\)` and `\(Y\)`, - `\(\rho_{X,Y} \approx 0\)` `\(\Rightarrow\)` no linear relationship between `\(X\)` and `\(Y\)`, -- .blue[Examples:] - Correlation between Height and Weight: `\(\rho_{H,W} = 0.7095302\)` - Correlaion between Height (in cm) and Weight: `\(\sigma_{H,W} = 0.7095302\)` - Correlation between Height and Age: `\(\rho_{H,A} = 0.0197248\)` --- ## Intuition on the covariance (I) .question[quizz4] <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Intuition on the covariance (II) .alert[All] dataset have the same summary statistics (and the same correlation `\(\rho = -0.06\)`) !! <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## About interpretation... <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> .blue[Conclusion:] - Correlation does not mean causality, - Correlation .alert[does not replace] graphical representation. --- ## About the effect of the outlier <img src="02_BivariateAnalysis_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ## Exercise An economist investigates the relationship between education level and income in a small firm with 5 employees: ``` # A tibble: 5 × 3 Name Diploma Income <chr> <dbl> <dbl> 1 Engineer 5 36 2 CAP -2 14 3 DUT 2 21 4 High School 0 16 5 Msc 5 30 ``` Compute the expectation and variance per variable, and the covariance and correlation between the education level and income. -- .blue[Answer] `\(\widehat{\mu}_I = 23.4, \ \widehat{\mu}_D = 2,\ \widehat{\sigma}^2_I = 70.24 \ (8.4), \ \widehat{\sigma}^2_D = 7.6\ (2.75)\)` `\(\widehat{\sigma}_{I,D} = 22, \ \widehat{\rho}_{I,D} = 0.95\)`