The code and the raw and clean data for the project could be found in: https://github.com/longm89/Dengue_prediction

Introduction

The purpose of the project is to predict the number of dengue cases in two cities, San Juan and Iquitos for each week, using environmental and climate variables. The challenge was organized in 2015 by several departments in the U.S. Federal Government, with the support of the Pandemic Prediction and Forecasting Science and Technology Interagency Working Group under the National Science and Technology Council (https://dengueforecasting.noaa.gov/).

Data

The data for each city consists of:

  • Time indicators
  • NOAA’s GHCN daily climate data weather station measurements.
  • PERSIANN satellite precipitation measurements.
  • NOAA’s NCEP Climate Forecast System Reanalysis measurements.
  • Satellite vegetation.
  • The number of cases for each week.

Additionally, we downloaded the environmental, social and economic data from WorldBank and we chose several parameters that might explain the number of cases:

  • forest_area_sq_km
  • Total population
  • population_density_people_per_sq_km_of_land_area
  • gdp_current_us
  • employment_to_population_average
  • Population age percentage: 0 - 9, 10 - 20, 20 - 60, 60+

Another data that can affect the spread of the disease is the number of migration, however, we couldn’t find the data.

Methods

We will model the total number of cases using the following models:

  • Model GAM with the Negative Binomial Distribution family
  • Regression Tree
  • Random Forest
  • Times Series (ARIMA)
  • Gradient Tree Boosting
  • Expert Aggregation (EG)

As our data is count data, we will be using the Poisson distribution and the Negative Binomial Distribution family when we apply the models GAM and Gradient Tree Boosting. The metric to judge the quality of the models in the competition is the MAE score, but we will also look at R squared, deviance explained, ACF, PACF graphs and plot of the residuals. In general, our out-of the box models achieve good results without much tuning. There are many places where the results could be improved. For example, there are some models with residuals of mean 0, but with changes in the variance, we could have applied an ARIMA model to the residuals of the model GAM, or we could fine-tune more the parameters of the models and do more feature engineering.

Data Wrangling and Exploration

Data Wrangling

Data from the challenge

The data for each city consists of:

  • Time indicators:
    • week_start_date
    • weekofyear
    • year
  • NOAA’s GHCN daily climate data weather station measurements.
    • station_max_temp_c - maximum temperature
    • station_min_temp_c - minimum temperature
    • station_avg_temp_c - average temperature
    • station_precip_mm - total precipitation
    • station_diur_temp_rng_c - diurnal temperature range
  • PERSIANN satellite precipitation measurements.
    • precipitation_amt_mm - total precipitation
  • NOAA’s NCEP Climate Forecast System Reanalysis measurements.
    • reanalysis_sat_precip_amt_mm – Total precipitation
    • reanalysis_dew_point_temp_k – Mean dew point temperature
    • reanalysis_air_temp_k – Mean air temperature
    • reanalysis_relative_humidity_percent – Mean relative humidity
    • reanalysis_specific_humidity_g_per_kg – Mean specific humidity
    • reanalysis_precip_amt_kg_per_m2 – Total precipitation
    • reanalysis_max_air_temp_k – Maximum air temperature
    • reanalysis_min_air_temp_k – Minimum air temperature
    • reanalysis_avg_temp_k – Average air temperature
    • reanalysis_tdtr_k – Diurnal temperature range
  • Satellite vegetation: Normalized difference vegetation index
    • ndvi_se – Pixel southeast of city centroid
    • ndvi_sw – Pixel southwest of city centroid
    • ndvi_ne – Pixel northeast of city centroid
    • ndvi_nw – Pixel northwest of city centroid

We separate the csv file, including environmental data of the two countries, into one file for each country and add the missing values using spline interpolation.

Data from WorldBank

We download the data from WorldBank, clean and select important variables that might contribute to the prediction of the number of Dengue cases:
* Total population
* population_density_people_per_sq_km_of_land_area
* forest_area_sq_km
* gdp_current_us
* employment_to_population_average
* Population age percentage: 0 - 9, 10 - 20, 20 - 60, 60+

These two sources of data form the following groups:
* Time of the year
* Climate variables
* Total population
* Population density
* Population age
* Economical condition

The code for cleaning the data could be found in: Dengue_prediction/R/wrangle-data.R. We import the clean data:

# import the cleaned data
load("rdas/merged_iq_train.rda")
load("rdas/merged_sj_train.rda")
Features engineering

In later sections, we will apply classical machine learning models such that Generalized Additive Model, Regression Tree, Random Forest, Arima model. As the number of cases is a time series: the number of cases of a week is highly influenced by the number of cases of the previous week, in order to have independent residual errors, we will add lag variables to reduce their auto-correlation. Our model will be able to use the number of cases from the past weeks to predict the number of cases of the current week.

#Add the lagging variables by 1-5 weeks

#for Iquitos
merged_iq_train <- merged_iq_train %>%
  mutate(lag_1_total_cases = lag(total_cases))
merged_iq_train[1, "lag_1_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
  mutate(lag_2_total_cases = lag(lag_1_total_cases))
merged_iq_train[1:2, "lag_2_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
  mutate(lag_3_total_cases = lag(lag_2_total_cases))
merged_iq_train[1:3, "lag_3_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
  mutate(lag_4_total_cases = lag(lag_3_total_cases))
merged_iq_train[1:4, "lag_4_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
  mutate(lag_5_total_cases = lag(lag(lag_3_total_cases)))
merged_iq_train[1:5, "lag_5_total_cases"] = 0

# for San Juan
merged_sj_train <- merged_sj_train %>%
  mutate(lag_1_total_cases = lag(total_cases))
merged_sj_train[1, "lag_1_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
  mutate(lag_2_total_cases = lag(lag_1_total_cases))
merged_sj_train[1:2, "lag_2_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
  mutate(lag_3_total_cases = lag(lag_2_total_cases))
merged_sj_train[1:3, "lag_3_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
  mutate(lag_4_total_cases = lag(lag_3_total_cases))
merged_sj_train[1:4, "lag_4_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
  mutate(lag_5_total_cases = lag(lag(lag_3_total_cases)))
merged_sj_train[1:5, "lag_5_total_cases"] = 0

We add a dummy variable Time term to model trend in the data over time.

# We add the Time term 
merged_iq_train$Time <- seq.int(nrow(merged_iq_train))
merged_sj_train$Time <- seq.int(nrow(merged_sj_train))

Data Exploration

Visualising the number of cases over the year:

For Iquitos:

######## For Iquitos
# Define Start and end times for the subset as R objects that are the time class
startTime_iq <- as.Date("2001-01-01")
endTime_iq <- as.Date("2009-12-24")
# create a start and end time R object
limits_iq <- c(startTime_iq, endTime_iq)
iq_weekly_cases <- ggplot(merged_iq_train, aes(week_start_date, total_cases)) +
  geom_line(na.rm=TRUE) + 
  geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
  ggtitle("Total number of cases from 2001 - 2010 in Iquitos") +
  xlab("Date") + ylab("Total number of cases")
# format x-axis: dates
iq_weekly_cases_time_series <- iq_weekly_cases + 
  (scale_x_date(limits = limits_iq,
                breaks = date_breaks("1 year"),
                labels = date_format("%b %y")))
iq_weekly_cases_time_series

# looking for seasonality over the week of the year
merged_iq_train %>% filter("2001" <= year & year <= "2009") %>% 
  ggplot(aes(x = weekofyear, y = total_cases, color = factor(year))) +
  geom_line()

The blue dot line is the moving average of 52 weeks, indicating the trend in the series. We also plot the number of cases over the year, and we see that there are more cases on weeks 1-10 and weeks 40-52.

# For San Juan
# create a start and end time R object
startTime_sj <- as.Date("1990-04-30")
endTime_sj <- as.Date("2000-04-22")

limits_sj <- c(startTime_sj, endTime_sj)
sj_weekly_cases <- ggplot(merged_sj_train, aes(week_start_date, total_cases)) +
  geom_line(na.rm=TRUE) + 
  geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
  ggtitle("Total number of cases from 1990 - 2008 in San Juan") +
  xlab("Date") + ylab("Total number of cases")
# format x-axis: dates
sj_weekly_cases_time_series <- sj_weekly_cases + 
  (scale_x_date(limits = limits_sj,
                breaks = date_breaks("1 year"),
                labels = date_format("%b %y")))

sj_weekly_cases_time_series

# look for seasonality, depending on the week of the year
merged_sj_train %>% filter("1991" <= year & year <= "1999") %>% 
  ggplot(aes(x = weekofyear, y = total_cases, color = factor(year))) +
  geom_line()

We do similar plots for San Juan. There’s also trend in the data. There are more cases on weeks 30-52.
In the following, we draw the two histograms of the number of cases and look at their mean and variance:

# draw the two histograms
iq_histogram <- ggplot(data=merged_iq_train, aes(total_cases)) +
  geom_histogram(aes(y =..density..), fill = "orange") +
  geom_density() +
  ggtitle("Histogram of total cases in Iquitos ")

sj_histogram <- ggplot(data=merged_sj_train, aes(total_cases)) +
  geom_histogram(aes(y =..density..), fill = "orange") +
  geom_density() + 
  ggtitle("Histogram of total cases in San Juan ") 

grid.arrange(iq_histogram, sj_histogram, ncol=2)

mean(merged_iq_train$total_cases) # mean of Iquitos
## [1] 7.565385
var(merged_iq_train$total_cases) # variance of Iquitos
## [1] 115.8955
mean(merged_sj_train$total_cases) # mean of San Juan
## [1] 34.18056
var(merged_sj_train$total_cases) # variance of San Juan
## [1] 2640.045

The number of cases for both cities do not follow Gaussian distributions and are natural numbers. Therefore, we will assume that they follow Poisson distribution or Negative Binomial distribution, and in particular Negative Binomial distribution as the mean is much smaller than the variance. In fact, the Poisson distribution can be interpreted as a special case of the Negative Binomial distribution when the parameter \(r \rightarrow \infty\) and is used as a way to model an over-dispersed Poisson distribution.

Modeling and Prediction

Preparing the data for training and testing

In the following, we will split the data into a training set and a testing set. As our data is time series, instead of a random selection of each set, we split our time series with respect to chronology, in order to train our models on the past data, and test the predictions on the future.

##################Split train and test sets
# For Iquitos
iq_train_size <- round(nrow(merged_iq_train) * 0.8)
iq_train <- head(merged_iq_train, iq_train_size)
iq_test <- tail(merged_iq_train, nrow(merged_iq_train) - iq_train_size)
# For San Juan
sj_train_size <- round(nrow(merged_sj_train) * 0.8)
sj_train <- head(merged_sj_train, sj_train_size)
sj_test <- tail(merged_sj_train, nrow(merged_sj_train) - sj_train_size)

The MAE score will be used, as in the competition, to evaluate different models.

mae<-function(y, ychap)
{
  return(round(mean(abs(y-ychap)), digit = 2))
}

Generalized Linear Models with Negative Binomial Distribution family.

GAM models

We first fit a GAM model with the Negative Binomial distribution family, using temperature features, the population feature and moreover, Time, weekofyear to model trend and seasonality. After that, we look at another model, this time adding auto-regressive features such as lag variables.

For Iquitos:

g1 <- gam(total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) + s(ndvi_sw)
  + s(population_total) 
  + s(Time) + s(weekofyear) 
  ,family = nb(), data = iq_train, method = "REML")
#gam.check(g1)
summary(g1)
## 
## Family: Negative Binomial(2.737) 
## Link function: log 
## 
## Formula:
## total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) + 
##     s(ndvi_sw) + s(population_total) + s(Time) + s(weekofyear)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.10561    0.07046   15.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                            edf Ref.df  Chi.sq p-value    
## s(reanalysis_specific_humidity_g_per_kg) 5.502  6.717  20.405 0.00479 ** 
## s(reanalysis_dew_point_temp_k)           1.003  1.004   2.268 0.13303    
## s(ndvi_sw)                               2.134  2.708   2.221 0.41128    
## s(population_total)                      5.991  6.323   1.685 0.94152    
## s(Time)                                  4.441  4.654   1.014 0.96198    
## s(weekofyear)                            4.750  5.868 104.960 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.414   Deviance explained = 68.7%
## -REML = 1042.7  Scale est. = 1         n = 416
#plot.gam(g1)

The deviance explained is 68.7%. The variables weekofyear and reanalysis_specific_humidity_g_per_kg are significant (with very small p-values) in predicting the number of cases. We look at the ACF, PACF and the graph of the residuals:

# on train set
train_ychap <- predict(g1, newdata = iq_train, type = "response")
# look at ACF and PACF of residuals to see if there's auto-correlation
par(mfrow=c(1,2))
acf(iq_train$total_cases - train_ychap)
pacf(iq_train$total_cases - train_ychap)

startTime_iq <- as.Date("2001-01-01")
endTime_iq <- as.Date("2009-12-24")
# create a start and end time R object
limits_iq <- c(startTime_iq, endTime_iq)
iq_weekly_cases <- ggplot(iq_train, aes(week_start_date, total_cases - train_ychap)) +
  geom_line(na.rm=TRUE) + 
  geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
  ggtitle("Residuals from 2001 - 2010 in Iquitos") +
  xlab("Date") + ylab("Residual of number of cases")
iq_weekly_cases

There are still a lot of auto-correlations in the data and there’s a problem with the variance in the residues. We look at how it fits on the testing set and look at the MAE score:

# on test set
test_ychap <- predict(g1, newdata = iq_test, type = "response")

plot(iq_test$total_cases, type='l')
lines(test_ychap,col='red')

mae(iq_test$total_cases, test_ychap)
## [1] 7.66

One idea to improve the model is to fit a SARIMA model to the residuals. In the following, to make things simple, we will just add lag variables to improve the model:

g2 <- gam(total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) + s(ndvi_sw)
  + s(population_total) 
  + s(Time) + s(weekofyear) 
  + s(log(lag_1_total_cases+1)) + s(log(lag_2_total_cases+1))+ s(log(lag_3_total_cases+1))
  ,family = nb(), data = iq_train, method = "REML")
#gam.check(g2)
summary(g2)
## 
## Family: Negative Binomial(4.892) 
## Link function: log 
## 
## Formula:
## total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) + 
##     s(ndvi_sw) + s(population_total) + s(Time) + s(weekofyear) + 
##     s(log(lag_1_total_cases + 1)) + s(log(lag_2_total_cases + 
##     1)) + s(log(lag_3_total_cases + 1))
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.1096     0.0601   18.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                            edf Ref.df Chi.sq  p-value    
## s(reanalysis_specific_humidity_g_per_kg) 5.665  6.873 17.899  0.01297 *  
## s(reanalysis_dew_point_temp_k)           1.001  1.001  0.650  0.42051    
## s(ndvi_sw)                               2.139  2.710  3.606  0.22351    
## s(population_total)                      7.181  7.927 43.288  < 2e-16 ***
## s(Time)                                  1.003  1.003  9.701  0.00187 ** 
## s(weekofyear)                            3.621  4.532 24.965 8.71e-05 ***
## s(log(lag_1_total_cases + 1))            1.001  1.003 48.568  < 2e-16 ***
## s(log(lag_2_total_cases + 1))            4.337  5.338 30.314 2.01e-05 ***
## s(log(lag_3_total_cases + 1))            2.681  3.424  7.424  0.09342 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.675   Deviance explained = 78.4%
## -REML = 973.49  Scale est. = 1         n = 416
#plot.gam(g2)


# on train set
train_ychap <- predict(g2, newdata = iq_train, type = "response")
# look at ACF and PACF of residuals to see if there's auto-correlation
par(mfrow=c(1,2))
par("mar"=c(5, 4, 4, 1))
acf(iq_train$total_cases - train_ychap)
pacf(iq_train$total_cases - train_ychap)

startTime_iq <- as.Date("2001-01-01")
endTime_iq <- as.Date("2009-12-24")
# create a start and end time R object
limits_iq <- c(startTime_iq, endTime_iq)
iq_weekly_cases <- ggplot(iq_train, aes(week_start_date, total_cases - train_ychap)) +
  geom_line(na.rm=TRUE) + 
  geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
  ggtitle("Residuals from 2001 - 2010 in Iquitos") +
  xlab("Date") + ylab("Residual of number of cases")
iq_weekly_cases

# on test set
par(mfrow=c(1,1))
test_ychap <- predict(g2, newdata = iq_test, type = "response")
mae(iq_test$total_cases, test_ychap)
## [1] 5.54
plot(iq_test$total_cases, type='l')
lines(test_ychap,col='red')