Tests statistiques efficaces pour les données de grande dimension
|Intervenant :||NGUYEN Binh Tuan|
|Directeur :||ARLOT Sylvain / THIRION Bertrand|
|Lieu :||Amphithéâtre Yoccoz|
Some Contributions to Modern Multiple Hypothesis Testing in High-dimension
This thesis deals with multiple testing problems in high-dimension, a regime which has become increasingly common nowadays in statistical inference. Its main goal is to provide efficient and reliable algorithms for multivariate inference, a hard problem that suffers from the curse of dimensionality. Our solutions improve on state-of-the-art methods, make them more stable and efficient while still maintaining their theoretical guarantees on controlling multiple testing metrics. Moreover, we show that our contributions perform reasonably well compared to the state-of-the-art in practical applications, considering analysis problems from life-sciences, such as neuroscience, medical imaging and genomics. In particular, we study the properties of knockoff filters, a method for controlling False Discovery Rate (FDR) that requires limited distribution assumptions. We then propose methods for aggregating several samplings to address knockoff filter’s randomness, and prove non-asymptotic theoretical results on the aggregated knockoff, specifically guaranteed FDR control, which relies on some concentration inequalities. Furthermore, we extend the method, providing a version that can scale to extremely high dimensional regime. One of the key steps is to use randomized clustering to reduce the dimension to avoid the curse of dimensionality, and then to ensemble several runs to tame the bias from the selection of a fixed clustering. In order to take into account the compression of data that results from the clustering step, we introduce a spatial relaxation of the False Discovery Rate. Finally, we consider the problem of outputting p-values in the conditional inference for high-dimensional logistic regression. This method is a variant of the Conditional Randomization Test, with an additional decorrelation scheme that yields more accurate test statistics and more powerful than previous estimators. We conclude the thesis with a discussion about some open questions, which we believe are important and can serve as potential directions to further improve high-dimensional inference methods and their application to fields such as genomics or brain imaging.