Precise analysis of some Purely Random Forests
Random forests (Breiman, 2001) are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem. As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of Breiman's random forests.
In the regression framework, the quadratic risk of a purely random forest can be written (approximately) as the sum of two terms, which can be understood as an approximation error and an estimation error. In this talk, we study how each of these terms depends on the size of each tree and on the number of trees in the forest.
Precise theoretical results are obtained for a toy model. On the one hand, if the regression function is smooth enough, the approximation error of an infinite forest decreases at a faster rate (with respect to the size of each tree) than the one of a single tree. On the other hand, the estimation error is of the same order of magnitude for a single tree and for an infinite forest. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees, because of their better approximation properties.
These results on the approximation error and on the risk can be generalized to other purely random forests. Numerical experiments on some purely random forests close to Breiman's algorithm (hold-out random forests) suggest that they behave similarly, which sheds light on how Breiman's random forests work.
This talk is based on joint works with Robin Genuer.
References:
http://arxiv.org/abs/1407.3939v2
http://arxiv.org/abs/1604.01515