Exercice 1¶

L'idée était de vous faire mettre en forme un jeu de données vous-même, avant d'effectuer une ACP. En effet l'ACP directe n'est pas possible car il y a trop de lignes incomplètes, des colonnes a priori peu pertinentes, et des données de type séries temporelles (que l'on ramènera à une seule valeur).

In [1]:
# Import du jeu de données brut
data_orig <- read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
In [2]:
# Affichage d'un aperçu du jeu de données ; fonctions head() et summary() = bons réflexes :)
data <- data_orig
dim(data)
head(data)
  1. 344917
  2. 67
A data.frame: 6 × 67
iso_codecontinentlocationdatetotal_casesnew_casesnew_cases_smoothedtotal_deathsnew_deathsnew_deaths_smoothed⋯male_smokershandwashing_facilitieshospital_beds_per_thousandlife_expectancyhuman_development_indexpopulationexcess_mortality_cumulative_absoluteexcess_mortality_cumulativeexcess_mortalityexcess_mortality_cumulative_per_million
<chr><chr><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1AFGAsiaAfghanistan2020-01-03NA0NANA0NA⋯NA37.7460.564.830.51141128772NANANANA
2AFGAsiaAfghanistan2020-01-04NA0NANA0NA⋯NA37.7460.564.830.51141128772NANANANA
3AFGAsiaAfghanistan2020-01-05NA0NANA0NA⋯NA37.7460.564.830.51141128772NANANANA
4AFGAsiaAfghanistan2020-01-06NA0NANA0NA⋯NA37.7460.564.830.51141128772NANANANA
5AFGAsiaAfghanistan2020-01-07NA0NANA0NA⋯NA37.7460.564.830.51141128772NANANANA
6AFGAsiaAfghanistan2020-01-08NA0 0NA0 0⋯NA37.7460.564.830.51141128772NANANANA
In [3]:
# On se limite pour l'instant à des statistiques globales prises à un jour fixé :
dmax <- '2021-12-31' #max(data$date)
data <- data[data$date == dmax,] #filtre sur les lignes, donc *avant* la virgule
dim(data)
summary(data)
  1. 254
  2. 67
   iso_code          continent           location             date          
 Length:254         Length:254         Length:254         Length:254        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  total_cases          new_cases       new_cases_smoothed   total_deaths    
 Min.   :        1   Min.   :      0   Min.   :      0.0   Min.   :      1  
 1st Qu.:    18326   1st Qu.:      0   1st Qu.:     37.0   1st Qu.:    281  
 Median :   148839   Median :    295   Median :    296.8   Median :   2468  
 Mean   :  5085552   Mean   :  22919   Mean   :  19680.6   Mean   : 101373  
 3rd Qu.:   806066   3rd Qu.:   1768   3rd Qu.:   1848.0   3rd Qu.:  16624  
 Max.   :285446097   Max.   :1349829   Max.   :1122946.0   Max.   :5474098  
 NA's   :19          NA's   :8         NA's   :8           NA's   :29       
   new_deaths      new_deaths_smoothed total_cases_per_million
 Min.   :   0.00   Min.   :   0.000    Min.   :     8.99      
 1st Qu.:   0.00   1st Qu.:   0.000    1st Qu.:  8437.41      
 Median :   0.00   Median :   1.214    Median : 57981.15      
 Mean   : 101.60   Mean   : 112.105    Mean   : 69265.17      
 3rd Qu.:   9.75   3rd Qu.:  13.607    3rd Qu.:108410.14      
 Max.   :5965.00   Max.   :6481.714    Max.   :289593.33      
 NA's   :8         NA's   :8           NA's   :19             
 new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million
 Min.   :   0.00       Min.   :   0.000               Min.   :   1.086        
 1st Qu.:   0.00       1st Qu.:   7.484               1st Qu.: 139.656        
 Median :  48.22       Median :  83.628               Median : 720.977        
 Mean   : 543.64       Mean   : 370.260               Mean   :1020.674        
 3rd Qu.: 358.59       3rd Qu.: 426.848               3rd Qu.:1658.001        
 Max.   :6861.15       Max.   :3543.851               Max.   :5949.676        
 NA's   :8             NA's   :8                      NA's   :29              
 new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate
 Min.   : 0.0000        Min.   : 0.0000                 Min.   :0.020    
 1st Qu.: 0.0000        1st Qu.: 0.0000                 1st Qu.:0.990    
 Median : 0.0000        Median : 0.1995                 Median :1.250    
 Mean   : 1.2384        Mean   : 1.3266                 Mean   :1.307    
 3rd Qu.: 0.9377        3rd Qu.: 1.4287                 3rd Qu.:1.570    
 Max.   :29.6820        Max.   :16.5200                 Max.   :4.220    
 NA's   :8              NA's   :8                       NA's   :69       
  icu_patients     icu_patients_per_million hosp_patients  
 Min.   :   19.0   Min.   : 0.579           Min.   :   67  
 1st Qu.:   75.5   1st Qu.:10.996           1st Qu.:  575  
 Median :  317.0   Median :17.341           Median : 1599  
 Mean   : 1129.5   Mean   :25.830           Mean   : 6510  
 3rd Qu.:  774.2   3rd Qu.:37.632           3rd Qu.: 3284  
 Max.   :18382.0   Max.   :84.912           Max.   :93776  
 NA's   :220       NA's   :220              NA's   :220    
 hosp_patients_per_million weekly_icu_admissions
 Min.   : 50.39            Min.   :   5.0       
 1st Qu.: 96.10            1st Qu.: 162.8       
 Median :157.06            Median : 523.0       
 Mean   :177.82            Mean   : 730.1       
 3rd Qu.:216.94            3rd Qu.:1130.8       
 Max.   :500.19            Max.   :1980.0       
 NA's   :220               NA's   :246          
 weekly_icu_admissions_per_million weekly_hosp_admissions
 Min.   : 0.529                    Min.   :  230         
 1st Qu.: 9.080                    1st Qu.:  894         
 Median :15.690                    Median : 2434         
 Mean   :15.395                    Mean   : 8879         
 3rd Qu.:22.003                    3rd Qu.: 8993         
 Max.   :29.198                    Max.   :92322         
 NA's   :246                       NA's   :235           
 weekly_hosp_admissions_per_million  total_tests          new_tests      
 Min.   : 30.10                     Min.   :    55939   Min.   :    162  
 1st Qu.: 74.18                     1st Qu.:  2056989   1st Qu.:   5700  
 Median :121.31                     Median :  7072918   Median :  18268  
 Mean   :133.89                     Mean   : 38588425   Mean   : 141706  
 3rd Qu.:203.27                     3rd Qu.: 28473905   3rd Qu.:  71131  
 Max.   :272.91                     Max.   :726359152   Max.   :2017702  
 NA's   :235                        NA's   :156         NA's   :161      
 total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
 Min.   :   11.45         Min.   :  0.097        Min.   :      13  
 1st Qu.:  273.97         1st Qu.:  0.810        1st Qu.:    3442  
 Median :  954.34         Median :  2.616        Median :   11916  
 Mean   : 1876.38         Mean   :  7.742        Mean   :  201231  
 3rd Qu.: 2092.83         3rd Qu.:  5.987        3rd Qu.:   49380  
 Max.   :21654.04         Max.   :189.146        Max.   :14769984  
 NA's   :156              NA's   :161            NA's   :118       
 new_tests_smoothed_per_thousand positive_rate    tests_per_case   
 Min.   :  0.008                 Min.   :0.0000   Min.   :    1.5  
 1st Qu.:  0.477                 1st Qu.:0.0443   1st Qu.:    4.2  
 Median :  1.749                 Median :0.0970   Median :   10.3  
 Mean   :  5.240                 Mean   :0.1394   Mean   :  696.6  
 3rd Qu.:  4.349                 3rd Qu.:0.2386   3rd Qu.:   22.6  
 Max.   :110.367                 Max.   :0.6703   Max.   :65979.5  
 NA's   :118                     NA's   :127      NA's   :127      
 tests_units        total_vaccinations  people_vaccinated  
 Length:254         Min.   :1.050e+05   Min.   :5.568e+04  
 Class :character   1st Qu.:4.256e+06   1st Qu.:2.258e+06  
 Mode  :character   Median :1.549e+07   Median :8.206e+06  
                    Mean   :3.647e+08   Mean   :1.795e+08  
                    3rd Qu.:9.113e+07   3rd Qu.:4.509e+07  
                    Max.   :9.178e+09   Max.   :4.558e+09  
                    NA's   :155         NA's   :162        
 people_fully_vaccinated total_boosters      new_vaccinations  
 Min.   :4.931e+04       Min.   :     5245   Min.   :      45  
 1st Qu.:1.832e+06       1st Qu.:   834874   1st Qu.:    4035  
 Median :6.659e+06       Median :  2729817   Median :   28059  
 Mean   :1.506e+08       Mean   : 28731545   Mean   : 1611225  
 3rd Qu.:3.982e+07       3rd Qu.:  9342736   3rd Qu.:  186534  
 Max.   :3.879e+09       Max.   :536089356   Max.   :35457928  
 NA's   :161             NA's   :179         NA's   :169       
 new_vaccinations_smoothed total_vaccinations_per_hundred
 Min.   :       0          Min.   :  6.79                
 1st Qu.:     987          1st Qu.: 96.15                
 Median :   10899          Median :148.62                
 Mean   :  629396          Mean   :133.91                
 3rd Qu.:   69828          3rd Qu.:180.89                
 Max.   :35763420          Max.   :275.36                
 NA's   :23                NA's   :155                   
 people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
 Min.   : 4.72                 Min.   : 2.05                      
 1st Qu.:50.73                 1st Qu.:43.51                      
 Median :69.69                 Median :63.24                      
 Mean   :62.04                 Mean   :55.63                      
 3rd Qu.:77.56                 3rd Qu.:71.47                      
 Max.   :93.19                 Max.   :89.24                      
 NA's   :162                   NA's   :161                        
 total_boosters_per_hundred new_vaccinations_smoothed_per_million
 Min.   : 0.020             Min.   :    0                        
 1st Qu.: 6.915             1st Qu.:  750                        
 Median :19.900             Median : 2253                        
 Mean   :22.071             Mean   : 2995                        
 3rd Qu.:33.240             3rd Qu.: 4310                        
 Max.   :57.400             Max.   :27586                        
 NA's   :179                NA's   :23                           
 new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred
 Min.   :      0                Min.   :0.00000                           
 1st Qu.:    199                1st Qu.:0.01500                           
 Median :   2002                Median :0.03300                           
 Mean   : 121963                Mean   :0.06284                           
 3rd Qu.:  16102                3rd Qu.:0.06900                           
 Max.   :6989791                Max.   :0.50500                           
 NA's   :23                     NA's   :23                                
 stringency_index population_density    median_age    aged_65_older   
 Min.   : 6.48    Min.   :    0.137   Min.   :15.10   Min.   : 1.144  
 1st Qu.:35.19    1st Qu.:   38.612   1st Qu.:22.27   1st Qu.: 3.526  
 Median :43.95    Median :   90.672   Median :29.80   Median : 6.378  
 Mean   :44.51    Mean   :  444.976   Mean   :30.53   Mean   : 8.703  
 3rd Qu.:53.80    3rd Qu.:  225.097   3rd Qu.:38.80   3rd Qu.:13.928  
 Max.   :85.08    Max.   :20546.766   Max.   :48.20   Max.   :27.049  
 NA's   :73       NA's   :39          NA's   :54      NA's   :61      
 aged_70_older    gdp_per_capita     extreme_poverty  cardiovasc_death_rate
 Min.   : 0.526   Min.   :   661.2   Min.   : 0.100   Min.   : 79.37       
 1st Qu.: 2.099   1st Qu.:  4126.5   1st Qu.: 0.625   1st Qu.:174.59       
 Median : 3.893   Median : 12595.3   Median : 2.500   Median :245.06       
 Mean   : 5.499   Mean   : 19173.6   Mean   :13.848   Mean   :264.27       
 3rd Qu.: 8.638   3rd Qu.: 27341.8   3rd Qu.:21.350   3rd Qu.:331.93       
 Max.   :18.493   Max.   :116935.6   Max.   :77.600   Max.   :724.42       
 NA's   :56       NA's   :58         NA's   :128      NA's   :58           
 diabetes_prevalence female_smokers   male_smokers   handwashing_facilities
 Min.   : 0.990      Min.   : 0.10   Min.   : 7.70   Min.   :  1.188       
 1st Qu.: 5.378      1st Qu.: 1.90   1st Qu.:22.60   1st Qu.: 20.482       
 Median : 7.205      Median : 6.30   Median :33.10   Median : 49.691       
 Mean   : 8.561      Mean   :10.79   Mean   :32.91   Mean   : 50.789       
 3rd Qu.:10.770      3rd Qu.:19.20   3rd Qu.:41.30   3rd Qu.: 82.687       
 Max.   :30.530      Max.   :44.00   Max.   :78.10   Max.   :100.000       
 NA's   :48          NA's   :107     NA's   :109     NA's   :158           
 hospital_beds_per_thousand life_expectancy human_development_index
 Min.   : 0.100             Min.   :53.28   Min.   :0.3940         
 1st Qu.: 1.300             1st Qu.:69.59   1st Qu.:0.6030         
 Median : 2.500             Median :75.05   Median :0.7400         
 Mean   : 3.097             Mean   :73.74   Mean   :0.7225         
 3rd Qu.: 4.200             3rd Qu.:79.46   3rd Qu.:0.8287         
 Max.   :13.800             Max.   :86.75   Max.   :0.9570         
 NA's   :81                 NA's   :21      NA's   :64             
   population        excess_mortality_cumulative_absolute
 Min.   :4.700e+01   Min.   : -13147.4                   
 1st Qu.:4.677e+05   1st Qu.:    205.4                   
 Median :5.763e+06   Median :   4737.8                   
 Mean   :1.275e+08   Mean   :  52703.1                   
 3rd Qu.:2.827e+07   3rd Qu.:  22096.6                   
 Max.   :7.975e+09   Max.   :1076833.1                   
                     NA's   :189                         
 excess_mortality_cumulative excess_mortality
 Min.   :-6.47               Min.   :-24.71  
 1st Qu.: 6.30               1st Qu.:  3.85  
 Median :19.24               Median : 13.91  
 Mean   :17.43               Mean   : 18.83  
 3rd Qu.:25.67               3rd Qu.: 32.22  
 Max.   :50.05               Max.   : 76.05  
 NA's   :189                 NA's   :189     
 excess_mortality_cumulative_per_million
 Min.   :-1034.9                        
 1st Qu.:  782.7                        
 Median : 1773.4                        
 Mean   : 2213.6                        
 3rd Qu.: 3305.7                        
 Max.   : 7421.2                        
 NA's   :189                            
In [4]:
# Le jeu de données comprend quelques "lignes résumé" (continents, catégories de revenus...).
# En observant un peu, on remarque que leur code ISO démarre par "OWID_" :
owid_lines <- startsWith(data$iso_code, "OWID_")
data[owid_lines,]
A data.frame: 18 × 67
iso_codecontinentlocationdatetotal_casesnew_casesnew_cases_smoothedtotal_deathsnew_deathsnew_deaths_smoothed⋯male_smokershandwashing_facilitieshospital_beds_per_thousandlife_expectancyhuman_development_indexpopulationexcess_mortality_cumulative_absoluteexcess_mortality_cumulativeexcess_mortalityexcess_mortality_cumulative_per_million
<chr><chr><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
2093OWID_AFR Africa 2021-12-31 9850360 58952 44044.714 228878 295 211.571⋯ NA NA NA NA NA1426736614 NA NA NA NA
17104OWID_ASI Asia 2021-12-31 84647174 114105 85279.143125661810551107.429⋯ NA NA NA NA NA4721383370 NA NA NA NA
87984OWID_ENGEuropeEngland 2021-12-31 NA NA NA NA NA NA⋯ NA NA NA NA NA 56550000 NA NA NA NA
96168OWID_EUR Europe 2021-12-31 86422548 565941 569201.000156617223083111.857⋯ NA NA NA NA NA 744807803 NA NA NA NA
97538OWID_EUN European Union 2021-12-31 53608595 241084 351451.000 91520811391654.286⋯ NA NA NA NA NA 450146793 NA NA NA NA
131649OWID_HIC High income 2021-12-311341327541084099 912211.429204208132973470.714⋯ NA NA NA NA NA1250514600 NA NA NA NA
158897OWID_KOSEuropeKosovo 2021-12-31 161399 53 20.571 2980 0 0.143⋯ NA NA NA NA NA 17821156396.432.6213.453848.565
173907OWID_LIC Low income 2021-12-31 1805520 23632 16829.571 42106 38 47.714⋯ NA NA NA NA NA 737604900 NA NA NA NA
175271OWID_LMC Lower middle income2021-12-31 65603392 70197 55666.7141187053 8311072.143⋯ NA NA NA NA NA3432097300 NA NA NA NA
221102OWID_NAM North America 2021-12-31 64102325 542800 357089.571122488720121750.286⋯ NA NA NA NA NA 600323657 NA NA NA NA
224823OWID_CYNAsia Northern Cyprus 2021-12-31 NA NA NA NA NA NA⋯ NA NA NA NA NA 382836 NA NA NA NA
225824OWID_NIREuropeNorthern Ireland 2021-12-31 NA NA NA NA NA NA⋯ NA NA NA NA NA 1896000 NA NA NA NA
229914OWID_OCE Oceania 2021-12-31 548908 670 14575.286 4989 14 13.714⋯ NA NA NA NA NA 45038860 NA NA NA NA
270779OWID_SCTEuropeScotland 2021-12-31 NA NA NA NA NA NA⋯ NA NA NA NA NA 5466000 NA NA NA NA
287145OWID_SAM South America 2021-12-31 39873056 67361 52751.4291192550 281 286.857⋯ NA NA NA NA NA 436816679 NA NA NA NA
328061OWID_UMC Upper middle income2021-12-31 83617794 169473 136426.714220015917931887.857⋯ NA NA NA NA NA2525921300 NA NA NA NA
337532OWID_WLSEuropeWales 2021-12-31 NA NA NA NA NA NA⋯ NA NA NA NA NA 3170000 NA NA NA NA
340184OWID_WRL World 2021-12-3128544609713498291122946.000547409859656481.714⋯34.63560.132.70572.580.7377975105024 NA NA NA NA
In [5]:
# Toutes ces lignes spéciales sont a priori à supprimer, soit car elles correspondent à des aggrégations à retrouver plus tard,
# soit car elles concernent des pays avec très peu de données, à l'exception semble-t-il du Kosovo. Gardons seulement cette ligne.
kosovo_index <- data$iso_code == "OWID_KOS"
kosovo_line <- data[kosovo_index,]
kosovo_line[1] <- "KOS"
data <- rbind(data[!owid_lines,], kosovo_line)
In [6]:
# Pour supprimer les lignes avec valeurs manquantes, il y a plusieurs options.
# J'en indique ici 3, de la plus complexe à la plus simple :
data1 <- data[apply(data, 1, function(row) all(!is.na(row))),] #méthode 1
data2 <- data[complete.cases(data),]                           #méthode 2
data3 <- na.omit(data)                                         #méthode 3
In [7]:
# On remarque alors qu'il n'y a plus aucune ligne ==> il faut restreindre les colonnes
# (ou "deviner" les valeurs manquantes d'une manière ou d'une autre : voir le package missMDA)
# (ici on se contente de la version simple : pas de données manquantes en entrée).
nrow(data3)
0
In [8]:
# Que représentent les variables ?
colnames(data)
  1. 'iso_code'
  2. 'continent'
  3. 'location'
  4. 'date'
  5. 'total_cases'
  6. 'new_cases'
  7. 'new_cases_smoothed'
  8. 'total_deaths'
  9. 'new_deaths'
  10. 'new_deaths_smoothed'
  11. 'total_cases_per_million'
  12. 'new_cases_per_million'
  13. 'new_cases_smoothed_per_million'
  14. 'total_deaths_per_million'
  15. 'new_deaths_per_million'
  16. 'new_deaths_smoothed_per_million'
  17. 'reproduction_rate'
  18. 'icu_patients'
  19. 'icu_patients_per_million'
  20. 'hosp_patients'
  21. 'hosp_patients_per_million'
  22. 'weekly_icu_admissions'
  23. 'weekly_icu_admissions_per_million'
  24. 'weekly_hosp_admissions'
  25. 'weekly_hosp_admissions_per_million'
  26. 'total_tests'
  27. 'new_tests'
  28. 'total_tests_per_thousand'
  29. 'new_tests_per_thousand'
  30. 'new_tests_smoothed'
  31. 'new_tests_smoothed_per_thousand'
  32. 'positive_rate'
  33. 'tests_per_case'
  34. 'tests_units'
  35. 'total_vaccinations'
  36. 'people_vaccinated'
  37. 'people_fully_vaccinated'
  38. 'total_boosters'
  39. 'new_vaccinations'
  40. 'new_vaccinations_smoothed'
  41. 'total_vaccinations_per_hundred'
  42. 'people_vaccinated_per_hundred'
  43. 'people_fully_vaccinated_per_hundred'
  44. 'total_boosters_per_hundred'
  45. 'new_vaccinations_smoothed_per_million'
  46. 'new_people_vaccinated_smoothed'
  47. 'new_people_vaccinated_smoothed_per_hundred'
  48. 'stringency_index'
  49. 'population_density'
  50. 'median_age'
  51. 'aged_65_older'
  52. 'aged_70_older'
  53. 'gdp_per_capita'
  54. 'extreme_poverty'
  55. 'cardiovasc_death_rate'
  56. 'diabetes_prevalence'
  57. 'female_smokers'
  58. 'male_smokers'
  59. 'handwashing_facilities'
  60. 'hospital_beds_per_thousand'
  61. 'life_expectancy'
  62. 'human_development_index'
  63. 'population'
  64. 'excess_mortality_cumulative_absolute'
  65. 'excess_mortality_cumulative'
  66. 'excess_mortality'
  67. 'excess_mortality_cumulative_per_million'
In [9]:
# Variables "new_*" : instantané journalier d'un certain indicateur. On ne s'y intéressera pas ici (cf. plus bas).
# De même pour les variables "weekly_*" (indicateurs hebdomadaires, j'imagine). Reste :
selection <- colnames(data)[!startsWith(colnames(data), "new_") & !startsWith(colnames(data), "weekly_")]
In [10]:
# Colonnes avec +50% de valeurs renseignées
selection <- selection[ apply(data[,selection], 2, function(col) sum(!is.na(col)) > nrow(data)/2) ]
selection
  1. 'iso_code'
  2. 'continent'
  3. 'location'
  4. 'date'
  5. 'total_cases'
  6. 'total_deaths'
  7. 'total_cases_per_million'
  8. 'total_deaths_per_million'
  9. 'reproduction_rate'
  10. 'positive_rate'
  11. 'tests_per_case'
  12. 'tests_units'
  13. 'stringency_index'
  14. 'population_density'
  15. 'median_age'
  16. 'aged_65_older'
  17. 'aged_70_older'
  18. 'gdp_per_capita'
  19. 'extreme_poverty'
  20. 'cardiovasc_death_rate'
  21. 'diabetes_prevalence'
  22. 'female_smokers'
  23. 'male_smokers'
  24. 'hospital_beds_per_thousand'
  25. 'life_expectancy'
  26. 'human_development_index'
  27. 'population'

On y voit enfin plus clair :

  • iso_code : identifiant d'un pays, sur 3 lettres
  • location : nom du pays
  • continent, date : heu, continent, et date =)
  • total_cases : nombre total de cas enregistrés jusqu'à dmax
  • total_deaths : nombre total de décès enregistrés jusqu'à dmax
  • total_cases_per_million : nombre relatif de cas totaux (par million)
  • total_deaths_per_million : nombre relatif de décès (par million)
  • tests_units : "Units used by the location to report its testing data" https://github.com/owid/covid-19-data/blob/master/public/data/README.md
    en fait cette colonne ne contient que ". " ==> inutilisable.
  • population : nombre d'habitants
  • population_density : densité de population (par kilomètre carré)
  • median_age : âge médian, 50% des gens sont plus jeunes et 50% plus vieux
  • aged_65_older : pourcentage de personnes dépassant 65 ans
  • aged_70_older : pareil avec 70 ans
  • gdp_per_capita : PIB par habitant
  • extreme_poverty : pourcentage de la population sous le seuil d'extrême pauvreté
  • cardiovasc_death_rate : "Death rate from cardiovascular disease in 2017 (annual number of deaths per 100,000 people)"
  • diabetes_prevalence : Diabetes prevalence (% of population aged 20 to 79) in 2017
  • female_smokers : pourcentage de fumeuses
  • male_smokers : pourcentage de fumeurs
  • hospital_beds_per_thousand : nombre de lits d'hôpital par tranche de 1000 habitants
  • life_expectancy : espérance de vie
  • human_development_index : indice de développement humain
In [11]:
data$tests_units #???
  1. ''
  2. 'tests performed'
  3. ''
  4. ''
  5. 'tests performed'
  6. ''
  7. ''
  8. ''
  9. 'tests performed'
  10. 'tests performed'
  11. ''
  12. 'tests performed'
  13. 'tests performed'
  14. 'tests performed'
  15. 'tests performed'
  16. 'units unclear'
  17. 'tests performed'
  18. ''
  19. 'tests performed'
  20. 'tests performed'
  21. 'tests performed'
  22. ''
  23. ''
  24. 'samples tested'
  25. 'tests performed'
  26. ''
  27. 'tests performed'
  28. 'tests performed'
  29. 'tests performed'
  30. ''
  31. ''
  32. 'tests performed'
  33. ''
  34. ''
  35. 'tests performed'
  36. ''
  37. 'tests performed'
  38. 'tests performed'
  39. ''
  40. ''
  41. ''
  42. 'tests performed'
  43. 'tests performed'
  44. 'tests performed'
  45. ''
  46. ''
  47. ''
  48. 'people tested'
  49. 'tests performed'
  50. 'people tested'
  51. ''
  52. 'tests performed'
  53. 'tests performed'
  54. 'tests performed'
  55. ''
  56. 'tests performed'
  57. ''
  58. ''
  59. 'samples tested'
  60. 'people tested'
  61. ''
  62. 'tests performed'
  63. 'tests performed'
  64. ''
  65. 'tests performed'
  66. ''
  67. 'tests performed'
  68. 'people tested'
  69. ''
  70. 'tests performed'
  71. 'tests performed'
  72. 'people tested'
  73. ''
  74. ''
  75. 'tests performed'
  76. 'tests performed'
  77. 'tests performed'
  78. 'tests performed'
  79. 'tests performed'
  80. ''
  81. 'samples tested'
  82. ''
  83. ''
  84. ''
  85. 'tests performed'
  86. 'people tested'
  87. ''
  88. ''
  89. ''
  90. 'tests performed'
  91. ''
  92. ''
  93. 'tests performed'
  94. 'tests performed'
  95. 'tests performed'
  96. 'samples tested'
  97. 'people tested'
  98. 'tests performed'
  99. 'tests performed'
  100. 'tests performed'
  101. ''
  102. 'people tested'
  103. 'tests performed'
  104. 'samples tested'
  105. 'people tested'
  106. ''
  107. 'tests performed'
  108. ''
  109. 'tests performed'
  110. ''
  111. 'tests performed'
  112. ''
  113. 'tests performed'
  114. 'tests performed'
  115. ''
  116. ''
  117. ''
  118. 'samples tested'
  119. 'tests performed'
  120. 'tests performed'
  121. 'tests performed'
  122. ''
  123. 'tests performed'
  124. 'tests performed'
  125. 'people tested'
  126. 'samples tested'
  127. ''
  128. 'tests performed'
  129. ''
  130. ''
  131. 'tests performed'
  132. ''
  133. ''
  134. 'people tested'
  135. ''
  136. 'tests performed'
  137. ''
  138. 'samples tested'
  139. ''
  140. ''
  141. 'people tested'
  142. 'tests performed'
  143. 'samples tested'
  144. 'tests performed'
  145. ''
  146. 'samples tested'
  147. 'tests performed'
  148. ''
  149. 'tests performed'
  150. ''
  151. ''
  152. 'tests performed'
  153. ''
  154. 'samples tested'
  155. 'tests performed'
  156. ''
  157. 'people tested'
  158. ''
  159. 'tests performed'
  160. ''
  161. 'tests performed'
  162. 'tests performed'
  163. ''
  164. 'tests performed'
  165. 'tests performed'
  166. 'people tested'
  167. ''
  168. 'tests performed'
  169. 'tests performed'
  170. 'tests performed'
  171. 'tests performed'
  172. ''
  173. 'tests performed'
  174. 'tests performed'
  175. 'samples tested'
  176. ''
  177. ''
  178. 'people tested'
  179. ''
  180. ''
  181. ''
  182. 'tests performed'
  183. ''
  184. ''
  185. ''
  186. 'tests performed'
  187. 'tests performed'
  188. 'people tested'
  189. ''
  190. ''
  191. ''
  192. ''
  193. 'tests performed'
  194. 'tests performed'
  195. ''
  196. ''
  197. 'people tested'
  198. 'people tested'
  199. 'tests performed'
  200. 'tests performed'
  201. 'tests performed'
  202. ''
  203. 'tests performed'
  204. 'tests performed'
  205. 'tests performed'
  206. ''
  207. 'people tested'
  208. ''
  209. ''
  210. 'tests performed'
  211. 'tests performed'
  212. 'tests performed'
  213. ''
  214. ''
  215. 'people tested'
  216. 'people tested'
  217. 'tests performed'
  218. ''
  219. ''
  220. ''
  221. 'tests performed'
  222. 'tests performed'
  223. 'tests performed'
  224. 'tests performed'
  225. 'tests performed'
  226. 'tests performed'
  227. 'people tested'
  228. ''
  229. ''
  230. ''
  231. ''
  232. 'people tested'
  233. ''
  234. ''
  235. 'tests performed'
  236. 'tests performed'
  237. 'tests performed'
In [12]:
selection <- selection[selection != "tests_units"]
newData <- na.omit(data[,selection])
nrow(newData)
72

92 lignes est raisonnable (proche de 50% de la taille du jeu de données initial). Cependant, pour être cohérent il faut en plus choisir un type de variable : absolu, ou relatif ? Je préfère les indicateurs relatifs (*_per_million, *_density) :

In [13]:
selection <- selection[!selection %in% c("total_cases", "total_deaths", "population")]
newData <- na.omit(data[,selection])
nrow(newData) #92 encore
72
In [14]:
rownames(newData) <- newData$iso_code #pour l'affichage des individus
newData <- newData[,-c(1,4)] #suppression des colonnes "code ISO" et "date", désormais inutiles

Note : toute l'analyse jusqu'ici aurait pu se faire aussi facilement avec un autre langage, Python par exemple.
À partir d'ici cependant, le package R FactoMineR est très pratique (pas d'équivalent Python (?!))

In [15]:
# ...On est enfin prêt pour l'ACP !
library(FactoMineR)
res.pca <- PCA(newData, quali.sup=1:2, ncp=6, graph=FALSE)
In [16]:
options(repr.plot.width=15, repr.plot.height=10)
plotInd <- plot(res.pca, choix="ind", invisible="quali")
plotVar <- plot(res.pca, choix="var")
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message:
“ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
No description has been provided for this image

Bon, il semble qu'aucun individu ne se détache. En fait certaines lignes ont presque toutes leurs valeurs renseignées, et une fois complétées à l'aide de Google on trouve des individus extrêmes (Monaco, Singapour). EDIT : ici je n'ai pas complété le fichier - vous pouvez le faire.

In [17]:
extremes <- which(data$location %in% c("Monaco", "Singapore"))
data[extremes,selection]
A data.frame: 2 × 23
iso_codecontinentlocationdatetotal_cases_per_milliontotal_deaths_per_millionreproduction_ratepositive_ratetests_per_casestringency_index⋯aged_70_oldergdp_per_capitaextreme_povertycardiovasc_death_ratediabetes_prevalencefemale_smokersmale_smokershospital_beds_per_thousandlife_expectancyhuman_development_index
<chr><chr><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
197910MCOEuropeMonaco 2021-12-31138280.671041.3530.93NANA34.13⋯ NA NANA NA 5.46 NA NA13.886.75 NA
277597SGPAsia Singapore2021-12-31 49566.07 146.8861.14NANA42.77⋯7.04985535.38NA92.24310.995.228.3 2.483.620.938

Côté variables, aged_65_older et aged_70_older apparaissent très corrélées (en fait même confondues). C'est logique, on gardera donc seulement aged_70_older après vérification numérique :

In [18]:
cor(newData[,c("aged_65_older", "aged_70_older")])
A matrix: 2 × 2 of type dbl
aged_65_olderaged_70_older
aged_65_older1.00000000.9939191
aged_70_older0.99391911.0000000
In [19]:
newData <- subset(newData, select=-aged_65_older)
In [20]:
res.pca <- PCA(newData, quali.sup=1:2, ncp=6, graph=FALSE)
In [21]:
plotInd <- plot(res.pca, choix="ind", invisible="quali", habillage=1)
plotVar <- plot(res.pca, choix="var")
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message:
“ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
No description has been provided for this image

Environ 65% de l'inertie expliquée dans ce premier plan ACP (51 + 14). Le cercle des corrélations oppose logiquement "richesse" à droite (HDI, PIB/hab), avec "pauvreté" à gauche. Sur le nuage des individus cela correspond grossièrement à l'opposition Europe occidentale / Afrique (quelques exceptions : Tunisie, Seychelles, ...).

Il est intéressant de constater que les indicateurs de richesse sont très corrélés positivement au nombre de morts par millions, lui-même anti-corrélé avec extreme_poverty : le COVID frapperait plutôt les riches ? Mais pourquoi donc, puisque le virus est partout ? Et bien un élément de réponse se trouve dans ce même plan ACP : aged_70_older => on y vit plus vieux, et dans une moindre mesure diabetes_prevalence => plus de cas de diabète (à vérifier numériquement).

In [22]:
cor(newData[,c("aged_70_older", "total_deaths_per_million", "human_development_index",
               "diabetes_prevalence", "extreme_poverty")])
A matrix: 5 × 5 of type dbl
aged_70_oldertotal_deaths_per_millionhuman_development_indexdiabetes_prevalenceextreme_poverty
aged_70_older 1.00000000 0.6116174 0.8203092-0.02528401-0.5541654
total_deaths_per_million 0.61161735 1.0000000 0.5002665 0.14617300-0.4433506
human_development_index 0.82030920 0.5002665 1.0000000 0.17607414-0.7396914
diabetes_prevalence-0.02528401 0.1461730 0.1760741 1.00000000-0.3462567
extreme_poverty-0.55416536-0.4433506-0.7396914-0.34625674 1.0000000
In [23]:
newData
A data.frame: 72 × 20
continentlocationtotal_cases_per_milliontotal_deaths_per_millionreproduction_ratepositive_ratetests_per_casestringency_indexpopulation_densitymedian_ageaged_70_oldergdp_per_capitaextreme_povertycardiovasc_death_ratediabetes_prevalencefemale_smokersmale_smokershospital_beds_per_thousandlife_expectancyhuman_development_index
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
ALBEurope Albania 73495.9991130.0641.480.1110 9.046.30 104.87138.0 8.64311803.431 1.1304.19510.08 7.151.22.89078.570.795
ARGSouth AmericaArgentina 127015.6202596.6862.110.3190 3.136.52 16.17731.9 7.44118933.907 0.6191.032 5.5016.227.75.00076.670.845
AUSOceania Australia 14004.709 93.3252.200.0745 13.447.16 3.20237.910.12944648.710 0.5107.791 5.0713.016.53.84083.440.944
AUTEurope Austria 141452.2571867.7531.180.0042 239.844.16 106.74944.413.74845436.686 0.7145.183 6.3528.430.97.37081.540.922
BGDAsia Bangladesh 9262.063 163.9851.430.0219 45.729.631265.03627.5 3.262 3523.98414.8298.003 8.38 1.044.70.80072.590.632
BELEurope Belgium 179883.8242429.4091.250.1750 5.733.89 375.56441.812.84942658.576 0.2114.898 4.2925.131.45.64081.630.931
BIHEurope Bosnia and Herzegovina 89830.9284152.7371.360.1867 5.435.19 68.49642.510.71111713.895 0.2329.63510.0830.247.73.50077.400.780
BGREurope Bulgaria 108210.9804501.3571.420.0838 11.945.82 65.18044.713.27218563.307 1.5424.688 5.8130.144.47.45475.050.816
CANNorth AmericaCanada 54674.470 779.0541.490.2292 4.468.16 4.03741.410.79744017.591 0.5105.599 7.3712.016.62.50082.430.929
CHLSouth AmericaChile 92058.0651994.3141.190.0258 38.831.42 24.28235.4 6.93822767.037 1.3127.993 8.4634.241.52.11080.180.851
CHNAsia China 92.624 3.9971.310.000065979.579.17 147.67438.7 5.92915308.712 0.7261.899 9.74 1.948.44.34076.910.761
COLSouth AmericaColombia 99059.2632503.4882.010.1350 7.449.40 44.22332.2 4.31213254.949 4.5124.240 7.44 4.713.51.71077.290.767
CRINorth AmericaCosta Rica 110206.1521419.4621.050.0910 11.045.31 96.07933.6 5.69415524.995 1.3137.973 8.78 6.417.41.13080.280.810
HRVEurope Croatia 176082.9863099.7221.310.3788 2.638.05 73.72644.013.05322669.797 0.7253.782 5.5934.339.95.54078.490.851
DNKEurope Denmark 133231.468 553.5291.340.1133 8.831.52 136.52042.312.32546682.515 0.2114.767 6.4119.318.82.50080.900.940
DOMNorth AmericaDominican Republic 37160.446 378.1341.620.0994 10.147.08 222.87327.6 4.41914600.861 1.6266.653 8.20 8.519.11.60074.080.756
ECUSouth AmericaEcuador 30320.5341870.3961.260.2760 3.653.30 66.93928.1 4.45810581.936 3.6140.448 5.55 2.012.31.50077.010.759
SLVNorth AmericaEl Salvador 19212.981 603.3400.650.0130 76.931.48 307.81127.6 5.417 7292.458 2.2167.295 8.87 2.518.81.30073.320.673
ESTEurope Estonia 171039.2561378.5161.270.1241 8.137.29 31.03342.713.49129481.252 0.5255.569 4.0224.539.34.69078.740.892
ETHAfrica Ethiopia 3367.185 56.1361.420.4357 2.340.74 104.95719.8 2.063 1729.92726.7182.634 7.47 0.4 8.50.30066.600.485
GMBAfrica Gambia 3758.322 126.7560.530.0534 18.713.89 207.56617.5 1.417 1561.76710.1331.430 1.91 0.731.21.10062.050.496
GEOAsia Georgia 249638.0583685.5181.020.0601 16.650.00 65.03238.710.244 9745.079 4.2496.218 7.11 5.355.52.60073.770.812
GHAAfrica Ghana 4364.905 39.0131.170.2351 4.351.27 126.71921.1 1.948 4227.63012.0298.245 4.97 0.3 7.70.90064.070.611
GRCEurope Greece 104473.8491974.4881.750.0581 17.274.28 83.47945.314.52424574.382 1.5175.695 4.5535.352.04.21082.240.888
HUNEurope Hungary 126053.6453931.4541.000.1975 5.127.96 108.04343.411.97626777.561 0.5278.296 7.5526.834.87.02076.880.854
INDAsia India 24583.308 339.4652.100.0101 98.768.64 450.41928.2 3.414 6426.67421.2282.28010.39 1.920.60.53069.660.645
IDNAsia Indonesia 15472.592 523.0251.160.0012 830.566.69 145.72529.3 3.05311188.744 5.7342.864 6.32 2.876.11.04071.720.718
IRNAsia Iran 69934.0291485.8400.860.0181 55.254.63 49.83132.4 3.18219082.620 0.2270.308 9.59 0.821.11.50076.680.783
IRLEurope Ireland 149793.9121211.2021.530.4300 2.342.84 69.87438.7 8.67867335.293 0.2126.459 3.2823.025.72.96082.300.955
ISRAsia Israel 146252.196 874.0612.100.0280 35.748.91 402.60630.6 7.35933132.320 0.5 93.320 6.7415.435.42.99082.970.919
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
MARAfrica Morocco 25656.966 396.2842.250.0777 12.965.27 80.08029.6 4.209 7485.013 1.0419.146 7.14 0.847.1 1.10076.680.686
MOZAfrica Mozambique 5587.555 60.5411.480.6374 1.637.04 37.72817.7 1.870 1136.10362.9329.942 3.30 5.129.1 0.70060.850.456
MMRAsia Myanmar 9797.725 355.6340.870.0116 86.380.56 81.72129.1 3.120 5591.597 6.4202.104 4.61 6.335.2 0.90067.130.583
NPLAsia Nepal 27119.361 379.5391.220.0250 40.033.33204.43025.0 3.212 2442.80415.0260.797 7.26 9.537.8 0.30070.780.602
NOREurope Norway 72669.388 256.5181.100.2700 3.751.85 14.46239.710.81364800.057 0.2114.316 5.3119.620.7 3.60082.400.957
PAKAsia Pakistan 5490.774 122.6381.440.0088113.547.52255.57323.5 2.780 5034.708 4.0423.031 8.35 2.836.7 0.60067.270.557
PANNorth AmericaPanama 111383.4331684.2151.710.1076 9.342.34 55.13329.7 5.03022267.037 2.2128.346 8.33 2.4 9.9 2.30078.510.815
PRYSouth AmericaParaguay 68738.9072451.6481.190.0445 22.540.74 17.14426.5 3.833 8827.010 1.7199.128 8.27 5.021.6 1.30074.250.728
PRTEurope Portugal 124888.6051838.9901.550.0731 13.748.15112.37146.214.92427936.896 0.5127.842 9.8516.330.0 3.39082.050.864
ROUEurope Romania 91927.2692986.5811.620.0275 36.456.48 85.12943.011.69023313.199 5.7370.946 9.7422.937.1 6.89276.050.828
RUSEurope Russia 72557.1262134.2890.820.0517 19.354.17 8.82339.6 9.39324765.954 0.1431.297 6.1823.458.3 8.05072.580.824
SVKEurope Slovakia 149152.0712947.6620.830.1070 9.356.34113.12841.2 9.16730155.152 0.7287.959 7.2923.137.7 5.82077.540.860
ZAFAfrica South Africa 57543.9721520.3720.780.2607 3.844.44 46.75427.3 3.05312294.87618.9200.380 5.52 8.133.2 2.32064.130.709
KORAsia South Korea 12259.772 108.5580.790.0236 42.346.39527.96743.4 8.62235938.374 0.2 85.998 6.80 6.240.912.27083.030.916
ESPEurope Spain 128265.6321919.2101.740.3080 3.243.44 93.10545.513.79934272.360 1.0 99.403 7.1727.431.4 2.97083.560.904
LKAAsia Sri Lanka 26898.175 686.0980.940.0702 14.253.80341.95534.1 5.33111669.077 0.7197.09310.68 0.327.0 3.60076.980.782
SWEEurope Sweden 125711.2641454.5921.570.1220 8.241.43 24.71841.013.43346949.283 0.5133.982 4.7918.818.9 2.22082.800.945
THAAsia Thailand 31011.538 302.6351.090.0744 13.443.06135.13240.1 6.89016277.671 0.1109.861 7.04 1.938.8 2.10077.150.777
TLSAsia Timor 14789.405 90.9570.700.0023429.140.58 87.17618.0 1.897 6570.10230.3335.346 6.86 6.378.1 5.90069.500.606
TGOAfrica Togo 3408.749 28.0271.640.2567 3.921.76143.36619.4 1.525 1429.81349.2280.033 6.15 0.914.2 0.70061.040.515
TUNAfrica Tunisia 58743.5402068.9351.650.0627 15.935.20 74.22832.7 5.07510849.297 2.0318.991 8.52 1.165.8 2.30076.700.740
TURAsia Turkey 110635.410 962.0671.390.0847 11.835.55104.91431.6 5.06125129.341 0.2171.28512.1314.141.1 2.81077.690.820
UGAAfrica Uganda 3019.518 69.7781.690.1726 5.873.15213.75916.4 1.308 1697.70741.6213.333 2.50 3.416.7 0.50063.370.544
UKREurope Ukraine 90829.3602330.7040.970.1988 5.039.49 77.39041.411.133 7894.393 0.1539.849 7.1113.547.4 8.80072.060.779
GBREurope United Kingdom199109.9962619.1051.240.1001 10.044.06272.89840.812.52739753.244 0.2122.137 4.2820.024.7 2.54081.320.932
USANorth AmericaUnited States 158249.7532421.1631.640.2600 3.847.65 35.60838.3 9.73254225.446 1.2151.08910.7919.124.6 2.77078.860.926
URYSouth AmericaUruguay 119875.9731802.0361.570.0773 12.928.70 19.75135.610.36120551.409 0.1160.708 6.9314.019.9 2.80077.910.817
VNMAsia Vietnam 17632.268 329.9221.030.0926 10.869.44308.12732.6 4.718 6171.884 2.0245.465 6.00 1.045.9 2.60075.400.704
ZMBAfrica Zambia 12448.652 186.3351.550.3133 3.237.96 22.99517.7 1.542 3689.25157.5234.499 3.94 3.124.7 2.00063.890.584
ZWEAfrica Zimbabwe 12973.101 306.1790.810.2858 3.547.22 42.72919.6 1.882 1899.77521.4307.846 1.82 1.630.7 1.70061.490.571

La corrélation (resp. anti-corrélation) aged_70_older avec HDI (resp. extreme_poverty) et total_deaths est vérifiée. De même, on observe une légère corrélation positive (resp. négative) entre diabetes_prevalence et HDI (resp. extreme_poverty).

Ensuite, le taux de fumeuses semble très corrélé à l'âge médian : les femmes auraient plus tendance à fumer dans les pays où l'on vit plus vieux (donc en général plus riches). Ce n'est pas le cas de male_smokers : le taux de fumeurs n'indique quant à lui pas grand chose. De même, et plus étonnament, le taux de mortalité par maladies cardiovasculaires (infarctus j'imagine) ne paraît corrélé à rien - si ce n'est justement et assez logiquement, la proportion de fumeurs : "According to the American Heart Association, cardiovascular disease accounts for about 800,000 U.S. deaths every year,5 making it the leading cause of all deaths in the United States. Of those, nearly 20 percent are due to cigarette smoking." [https://www.fda.gov/tobacco-products/health-effects-tobacco-use/how-smoking-affects-heart-health#]

La coloration par continents montre une opposition haut/bas entre Europe de l'est et Europe de l'ouest + USA/Canada/Israel/Corée/Australie. Il semble y avoir relativement plus de fumeurs en Géorgie/Ukraine/Russie. les pays d'Amérique centrale et du sud sont plus bas, donc a priori moins touchés par les décès par infarctus et comportant moins de fumeurs. Il n'y a pas assez de pays d'Océanie pour en dire grand chose, et l'Asie est répartie un peu partout, montrant une grande inhomogénéité en comparaison aux autres continents.

Vérifions notre analyse en regardant de plus près quelques individus :

In [24]:
indivs_indices <- rownames(newData) %in% c("LUX", "UKR", "NER", "ECU")
newData[indivs_indices,c("location", "total_deaths_per_million", "aged_70_older", "male_smokers",
                         "cardiovasc_death_rate", "human_development_index")]
A data.frame: 3 × 6
locationtotal_deaths_per_millionaged_70_oldermale_smokerscardiovasc_death_ratehuman_development_index
<chr><dbl><dbl><dbl><dbl><dbl>
ECUEcuador 1870.396 4.45812.3140.4480.759
LUXLuxembourg 980.542 9.84226.0128.2750.916
UKRUkraine 2330.70411.13347.4539.8490.779

Luxembourg : population âgée, HDI élevé, 2x moins de fumeurs qu'en Ukraine mais 2x + qu'en Equateur.
Niger : population jeune, HDI bas, peu de fumeurs, très peu de morts du COVID.

Bref, passons au second plan ACP :

In [25]:
plotInd <- plot(res.pca, choix="ind", invisible="quali", habillage=1, axes=3:4)
plotVar <- plot(res.pca, choix="var", axes=3:4)
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message:
“ggrepel: 12 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
No description has been provided for this image

Peu d'inertie expliquée dans ce plan (à peine plus de 17%), mais une observation intéressante : anti-corrélation population_density et total_deaths_per_million ? À vérifier numériquement bien sûr car cette dernière flèche est loin du bord. Ce serait cependant cohérent : densément peuplé => contaminations plus faciles => plus de cas => plus de personnes très fragiles touchées => plus de morts.

On note aussi l'anti-corrélation entre diabetes_prevalence et extreme_poverty, déjà un peu observée dans le premier plan. Vérification numérique :

In [26]:
indivs_indices <- rownames(newData) %in% c("MLT", "EGY", "MNE", "MWI")
newData[indivs_indices,c("location", "total_deaths_per_million", "population_density",
                         "diabetes_prevalence", "extreme_poverty")]
A data.frame: 2 × 5
locationtotal_deaths_per_millionpopulation_densitydiabetes_prevalenceextreme_poverty
<chr><dbl><dbl><dbl><dbl>
MWIMalawi115.411 197.5193.9471.4
MLTMalta 924.4451454.0378.83 0.2

Opposition Égypte / Malawi vérifiée sur l'axe diabète/pauvreté, ainsi que Malte/Montenegro sur l'axe morts_par_million/densité.