The code and the raw and clean data for the project could be found in: https://github.com/longm89/Dengue_prediction
The purpose of the project is to predict the number of dengue cases in two cities, San Juan and Iquitos for each week, using environmental and climate variables. The challenge was organized in 2015 by several departments in the U.S. Federal Government, with the support of the Pandemic Prediction and Forecasting Science and Technology Interagency Working Group under the National Science and Technology Council (https://dengueforecasting.noaa.gov/).
The data for each city consists of:
Additionally, we downloaded the environmental, social and economic data from WorldBank and we chose several parameters that might explain the number of cases:
Another data that can affect the spread of the disease is the number of migration, however, we couldn’t find the data.
We will model the total number of cases using the following models:
As our data is count data, we will be using the Poisson distribution and the Negative Binomial Distribution family when we apply the models GAM and Gradient Tree Boosting. The metric to judge the quality of the models in the competition is the MAE score, but we will also look at R squared, deviance explained, ACF, PACF graphs and plot of the residuals. In general, our out-of the box models achieve good results without much tuning. There are many places where the results could be improved. For example, there are some models with residuals of mean 0, but with changes in the variance, we could have applied an ARIMA model to the residuals of the model GAM, or we could fine-tune more the parameters of the models and do more feature engineering.
The data for each city consists of:
We separate the csv file, including environmental data of the two countries, into one file for each country and add the missing values using spline interpolation.
We download the data from WorldBank, clean and select important variables that might contribute to the prediction of the number of Dengue cases:
* Total population
* population_density_people_per_sq_km_of_land_area
* forest_area_sq_km
* gdp_current_us
* employment_to_population_average
* Population age percentage: 0 - 9, 10 - 20, 20 - 60, 60+
These two sources of data form the following groups:
* Time of the year
* Climate variables
* Total population
* Population density
* Population age
* Economical condition
The code for cleaning the data could be found in: Dengue_prediction/R/wrangle-data.R. We import the clean data:
# import the cleaned data
load("rdas/merged_iq_train.rda")
load("rdas/merged_sj_train.rda")
In later sections, we will apply classical machine learning models such that Generalized Additive Model, Regression Tree, Random Forest, Arima model. As the number of cases is a time series: the number of cases of a week is highly influenced by the number of cases of the previous week, in order to have independent residual errors, we will add lag variables to reduce their auto-correlation. Our model will be able to use the number of cases from the past weeks to predict the number of cases of the current week.
#Add the lagging variables by 1-5 weeks
#for Iquitos
merged_iq_train <- merged_iq_train %>%
mutate(lag_1_total_cases = lag(total_cases))
merged_iq_train[1, "lag_1_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
mutate(lag_2_total_cases = lag(lag_1_total_cases))
merged_iq_train[1:2, "lag_2_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
mutate(lag_3_total_cases = lag(lag_2_total_cases))
merged_iq_train[1:3, "lag_3_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
mutate(lag_4_total_cases = lag(lag_3_total_cases))
merged_iq_train[1:4, "lag_4_total_cases"] = 0
merged_iq_train <- merged_iq_train %>%
mutate(lag_5_total_cases = lag(lag(lag_3_total_cases)))
merged_iq_train[1:5, "lag_5_total_cases"] = 0
# for San Juan
merged_sj_train <- merged_sj_train %>%
mutate(lag_1_total_cases = lag(total_cases))
merged_sj_train[1, "lag_1_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
mutate(lag_2_total_cases = lag(lag_1_total_cases))
merged_sj_train[1:2, "lag_2_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
mutate(lag_3_total_cases = lag(lag_2_total_cases))
merged_sj_train[1:3, "lag_3_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
mutate(lag_4_total_cases = lag(lag_3_total_cases))
merged_sj_train[1:4, "lag_4_total_cases"] = 0
merged_sj_train <- merged_sj_train %>%
mutate(lag_5_total_cases = lag(lag(lag_3_total_cases)))
merged_sj_train[1:5, "lag_5_total_cases"] = 0
We add a dummy variable Time term to model trend in the data over time.
# We add the Time term
merged_iq_train$Time <- seq.int(nrow(merged_iq_train))
merged_sj_train$Time <- seq.int(nrow(merged_sj_train))
For Iquitos:
######## For Iquitos
# Define Start and end times for the subset as R objects that are the time class
startTime_iq <- as.Date("2001-01-01")
endTime_iq <- as.Date("2009-12-24")
# create a start and end time R object
limits_iq <- c(startTime_iq, endTime_iq)
iq_weekly_cases <- ggplot(merged_iq_train, aes(week_start_date, total_cases)) +
geom_line(na.rm=TRUE) +
geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
ggtitle("Total number of cases from 2001 - 2010 in Iquitos") +
xlab("Date") + ylab("Total number of cases")
# format x-axis: dates
iq_weekly_cases_time_series <- iq_weekly_cases +
(scale_x_date(limits = limits_iq,
breaks = date_breaks("1 year"),
labels = date_format("%b %y")))
iq_weekly_cases_time_series
# looking for seasonality over the week of the year
merged_iq_train %>% filter("2001" <= year & year <= "2009") %>%
ggplot(aes(x = weekofyear, y = total_cases, color = factor(year))) +
geom_line()
The blue dot line is the moving average of 52 weeks, indicating the trend in the series. We also plot the number of cases over the year, and we see that there are more cases on weeks 1-10 and weeks 40-52.
# For San Juan
# create a start and end time R object
startTime_sj <- as.Date("1990-04-30")
endTime_sj <- as.Date("2000-04-22")
limits_sj <- c(startTime_sj, endTime_sj)
sj_weekly_cases <- ggplot(merged_sj_train, aes(week_start_date, total_cases)) +
geom_line(na.rm=TRUE) +
geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
ggtitle("Total number of cases from 1990 - 2008 in San Juan") +
xlab("Date") + ylab("Total number of cases")
# format x-axis: dates
sj_weekly_cases_time_series <- sj_weekly_cases +
(scale_x_date(limits = limits_sj,
breaks = date_breaks("1 year"),
labels = date_format("%b %y")))
sj_weekly_cases_time_series
# look for seasonality, depending on the week of the year
merged_sj_train %>% filter("1991" <= year & year <= "1999") %>%
ggplot(aes(x = weekofyear, y = total_cases, color = factor(year))) +
geom_line()
We do similar plots for San Juan. There’s also trend in the data. There are more cases on weeks 30-52.
In the following, we draw the two histograms of the number of cases and look at their mean and variance:
# draw the two histograms
iq_histogram <- ggplot(data=merged_iq_train, aes(total_cases)) +
geom_histogram(aes(y =..density..), fill = "orange") +
geom_density() +
ggtitle("Histogram of total cases in Iquitos ")
sj_histogram <- ggplot(data=merged_sj_train, aes(total_cases)) +
geom_histogram(aes(y =..density..), fill = "orange") +
geom_density() +
ggtitle("Histogram of total cases in San Juan ")
grid.arrange(iq_histogram, sj_histogram, ncol=2)
mean(merged_iq_train$total_cases) # mean of Iquitos
## [1] 7.565385
var(merged_iq_train$total_cases) # variance of Iquitos
## [1] 115.8955
mean(merged_sj_train$total_cases) # mean of San Juan
## [1] 34.18056
var(merged_sj_train$total_cases) # variance of San Juan
## [1] 2640.045
The number of cases for both cities do not follow Gaussian distributions and are natural numbers. Therefore, we will assume that they follow Poisson distribution or Negative Binomial distribution, and in particular Negative Binomial distribution as the mean is much smaller than the variance. In fact, the Poisson distribution can be interpreted as a special case of the Negative Binomial distribution when the parameter \(r \rightarrow \infty\) and is used as a way to model an over-dispersed Poisson distribution.
In the following, we will split the data into a training set and a testing set. As our data is time series, instead of a random selection of each set, we split our time series with respect to chronology, in order to train our models on the past data, and test the predictions on the future.
##################Split train and test sets
# For Iquitos
iq_train_size <- round(nrow(merged_iq_train) * 0.8)
iq_train <- head(merged_iq_train, iq_train_size)
iq_test <- tail(merged_iq_train, nrow(merged_iq_train) - iq_train_size)
# For San Juan
sj_train_size <- round(nrow(merged_sj_train) * 0.8)
sj_train <- head(merged_sj_train, sj_train_size)
sj_test <- tail(merged_sj_train, nrow(merged_sj_train) - sj_train_size)
The MAE score will be used, as in the competition, to evaluate different models.
mae<-function(y, ychap)
{
return(round(mean(abs(y-ychap)), digit = 2))
}
We first fit a GAM model with the Negative Binomial distribution family, using temperature features, the population feature and moreover, Time, weekofyear to model trend and seasonality. After that, we look at another model, this time adding auto-regressive features such as lag variables.
For Iquitos:
g1 <- gam(total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) + s(ndvi_sw)
+ s(population_total)
+ s(Time) + s(weekofyear)
,family = nb(), data = iq_train, method = "REML")
#gam.check(g1)
summary(g1)
##
## Family: Negative Binomial(2.737)
## Link function: log
##
## Formula:
## total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) +
## s(ndvi_sw) + s(population_total) + s(Time) + s(weekofyear)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.10561 0.07046 15.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(reanalysis_specific_humidity_g_per_kg) 5.502 6.717 20.405 0.00479 **
## s(reanalysis_dew_point_temp_k) 1.003 1.004 2.268 0.13303
## s(ndvi_sw) 2.134 2.708 2.221 0.41128
## s(population_total) 5.991 6.323 1.685 0.94152
## s(Time) 4.441 4.654 1.014 0.96198
## s(weekofyear) 4.750 5.868 104.960 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.414 Deviance explained = 68.7%
## -REML = 1042.7 Scale est. = 1 n = 416
#plot.gam(g1)
The deviance explained is 68.7%. The variables weekofyear and reanalysis_specific_humidity_g_per_kg are significant (with very small p-values) in predicting the number of cases. We look at the ACF, PACF and the graph of the residuals:
# on train set
train_ychap <- predict(g1, newdata = iq_train, type = "response")
# look at ACF and PACF of residuals to see if there's auto-correlation
par(mfrow=c(1,2))
acf(iq_train$total_cases - train_ychap)
pacf(iq_train$total_cases - train_ychap)
startTime_iq <- as.Date("2001-01-01")
endTime_iq <- as.Date("2009-12-24")
# create a start and end time R object
limits_iq <- c(startTime_iq, endTime_iq)
iq_weekly_cases <- ggplot(iq_train, aes(week_start_date, total_cases - train_ychap)) +
geom_line(na.rm=TRUE) +
geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
ggtitle("Residuals from 2001 - 2010 in Iquitos") +
xlab("Date") + ylab("Residual of number of cases")
iq_weekly_cases
There are still a lot of auto-correlations in the data and there’s a problem with the variance in the residues. We look at how it fits on the testing set and look at the MAE score:
# on test set
test_ychap <- predict(g1, newdata = iq_test, type = "response")
plot(iq_test$total_cases, type='l')
lines(test_ychap,col='red')
mae(iq_test$total_cases, test_ychap)
## [1] 7.66
One idea to improve the model is to fit a SARIMA model to the residuals. In the following, to make things simple, we will just add lag variables to improve the model:
g2 <- gam(total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) + s(ndvi_sw)
+ s(population_total)
+ s(Time) + s(weekofyear)
+ s(log(lag_1_total_cases+1)) + s(log(lag_2_total_cases+1))+ s(log(lag_3_total_cases+1))
,family = nb(), data = iq_train, method = "REML")
#gam.check(g2)
summary(g2)
##
## Family: Negative Binomial(4.892)
## Link function: log
##
## Formula:
## total_cases ~ s(reanalysis_specific_humidity_g_per_kg) + s(reanalysis_dew_point_temp_k) +
## s(ndvi_sw) + s(population_total) + s(Time) + s(weekofyear) +
## s(log(lag_1_total_cases + 1)) + s(log(lag_2_total_cases +
## 1)) + s(log(lag_3_total_cases + 1))
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1096 0.0601 18.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(reanalysis_specific_humidity_g_per_kg) 5.665 6.873 17.899 0.01297 *
## s(reanalysis_dew_point_temp_k) 1.001 1.001 0.650 0.42051
## s(ndvi_sw) 2.139 2.710 3.606 0.22351
## s(population_total) 7.181 7.927 43.288 < 2e-16 ***
## s(Time) 1.003 1.003 9.701 0.00187 **
## s(weekofyear) 3.621 4.532 24.965 8.71e-05 ***
## s(log(lag_1_total_cases + 1)) 1.001 1.003 48.568 < 2e-16 ***
## s(log(lag_2_total_cases + 1)) 4.337 5.338 30.314 2.01e-05 ***
## s(log(lag_3_total_cases + 1)) 2.681 3.424 7.424 0.09342 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.675 Deviance explained = 78.4%
## -REML = 973.49 Scale est. = 1 n = 416
#plot.gam(g2)
# on train set
train_ychap <- predict(g2, newdata = iq_train, type = "response")
# look at ACF and PACF of residuals to see if there's auto-correlation
par(mfrow=c(1,2))
par("mar"=c(5, 4, 4, 1))
acf(iq_train$total_cases - train_ychap)
pacf(iq_train$total_cases - train_ychap)
startTime_iq <- as.Date("2001-01-01")
endTime_iq <- as.Date("2009-12-24")
# create a start and end time R object
limits_iq <- c(startTime_iq, endTime_iq)
iq_weekly_cases <- ggplot(iq_train, aes(week_start_date, total_cases - train_ychap)) +
geom_line(na.rm=TRUE) +
geom_ma(ma_fun = SMA, n = 52) + # moving average with period of 52 weeks to detect trend
ggtitle("Residuals from 2001 - 2010 in Iquitos") +
xlab("Date") + ylab("Residual of number of cases")
iq_weekly_cases
# on test set
par(mfrow=c(1,1))
test_ychap <- predict(g2, newdata = iq_test, type = "response")
mae(iq_test$total_cases, test_ychap)
## [1] 5.54
plot(iq_test$total_cases, type='l')
lines(test_ychap,col='red')