Project presentation

Nowadays, it is claimed that more than half of the world population lives in urban areas. Added to the increasing amount of available urban data, this allows data-driven analyses to greatly improve the lives of citizens. As a consequence, we decided to focus more particularly on Taxi trips. Indeed, these are valuable sensors that can tell us about economic activity and human behavior in citys.

We found a data set in a kaggle competition supported by Google. The goal of this competition is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City. The dataset initially contains taxi trips from 2009 to mid-2015. For each row in the set, corresponding to a trip, there is the pickup and drop-off geolocalisations, the number of passengers, and the date and time of the trip as covariates. We considered this data valuable as they appeared to be rather exhaustive, covering more than 54M of trips. Moreover, there were no missing data, although some seemed erroneous as we will see later.

One the competition homepage, it was stated that basic predictors achieve RMSEs of about \(5-8\$\). Our objective was then to improve such scores. However, the dataset was too heavy for us to analyze and we modified the dataset under study to better match our computational power. Therefore, as you will see, we have decided to first aggregate these data and perform some prediction on these. After that, we have sampled \(50000\) from the original dataset. These were then split between train (trips before June 2014) and test sets (trips after June 2014).

The code we have produced is available at https://github.com/TRandrianarisoa/NYCTaxiFare.

Exploratory data analysis and feature engineering

Geographical insights

First, we have looked on the locastion of the taxi trips. Looking at extremal values, we have seen that some trips were located outside of New York, sometimes even on the other side of the world. In addition, we have only kept trips with less than 10 passengers, as more doesn’t seem much plausible. These rows were then discarded.

Below, we have plotted the locations that are the most likely to be associated to a trip. The bigger the dot is the more the color tend to a red shade, the more trips have their pickup (or drop-off) location located there.