Deep learning theory has made some good progress in the last years. Below is a personal (short) selection of papers and talks giving some theoretical understanding of Deep Learning (with a focus on feature learning). Many thanks to Lenaïc Chizat for many of the references.

The NTK limit is nicely explained in these [Short video, Long video], and in this [paper].

A more detailed presentation can be found in this [paper].

1-Hidden layer NN in the hydrodynamic regime has been intensively investigated recently. Some interesting feature learning phenomenons have been exhibited.

Feature learning in wide 1-Hidden layer NN [video]

In the paper Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss, Chizat and Bach analyse the evolution of the gradient flow corresponding to the limit of GD with infinitesimal gradient steps. In the considered setting, they essentially show that, after a transient kernel regime, the NN converges to a max-margin classifier on a certain functional space. This can be interprated as a max-margin classifier on some learned features. From a statistical perspective, it follows that the NN classifier adapts (at least) to the intrinsic dimension of the problem.

Wide 1-Hidden layer NN learns the informative directions [video]

In the same direction, in the paper When Do Neural Networks Outperform Kernel Methods? Ghorbani et al. show that, compared to Kernel methods, wide 1-hidden layer NN is able to learn the informative directions in the data, and thereby to avoid the curse of dimensionality.

Forward feature learning / backward feature correction

In the paper Backward Feature Correction: How Deep Learning Performs Deep Learning, they show, for resnet-like NN, how the hierarchy of features are learnt via a progressive correction mechanism during the SGD.

This result might be connected to the observation of Maclach and Shalev-Shwartz that, at least in a fractal toy model, GD will find good solution, if shallow networks are already good [Is depper better only when shallow is good?, video].

Feature purification

In the paper Feature Purification: How Adversarial Training Performs Robust Deep Learning Allen-Zhu and Li consider data modeled by a sparse decomposition on a (hidden) dictionary. They show how adversarial learning lead to a purification of the learned features, leading to a more sparse (and robust) representation of the data.

This paper builts on a series of 3 previous papers on tensor programs

Tensor programs I

Tensor programs II

Tensor programs III