Selected papers and talks on Deep Learning Theory

Deep learning theory has made some good progress in the last years. Below is a personal (short) selection of papers and talks giving some theoretical understanding of Deep Learning (with a focus on feature learning). Many thanks to Lenaïc Chizat for many of the references.

Neural tangent kernel

In wide neural networks, with standard initialization, the behavior of the neural network (NN) is well understood and corresponds to a kernel regression. Let us denote by \(\theta_0\) the initial parameters and by \(f(\theta,x)\) the NN output with parameter \(\theta\). It has been shown that when trained with gradient descent (GD) the infinitely wide NN with standard initialization behaves like a kernel regression with kernel $$K(x,y)=\langle \nabla_\theta f(\theta_0,x), \nabla_\theta f(\theta_0,y) \rangle.$$ The kernel \(K(x,y)\) is called the neural tangent kernal (NTK).

The NTK limit is nicely explained in these [Short video, Long video], and in this [paper].
A more detailed presentation can be found in this [paper].

Feature learning in wide 1-hidden layer neural networks

Feature learning is considered as one of the major ingredient of the success of deep learning. In the NTK regime mentioned above, no feature learning occurs. This suggests that, in the wide asymptotic, the scaling of standard initialisation is not appropriated. Instead, the scaling corresponding to hydrodynamic limit allows for feature learning.

1-Hidden layer NN in the hydrodynamic regime has been intensively investigated recently. Some interesting feature learning phenomenons have been exhibited.

Feature learning in wide 1-Hidden layer NN [video]
In the paper Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss, Chizat and Bach analyse the evolution of the gradient flow corresponding to the limit of GD with infinitesimal gradient steps. In the considered setting, they essentially show that, after a transient kernel regime, the NN converges to a max-margin classifier on a certain functional space. This can be interprated as a max-margin classifier on some learned features. From a statistical perspective, it follows that the NN classifier adapts (at least) to the intrinsic dimension of the problem.

Wide 1-Hidden layer NN learns the informative directions [video]
In the same direction, in the paper When Do Neural Networks Outperform Kernel Methods? Ghorbani et al. show that, compared to Kernel methods, wide 1-hidden layer NN is able to learn the informative directions in the data, and thereby to avoid the curse of dimensionality.

Deep is better even in linear models

When the activation funtion is the identity \(\sigma(x)=x\), neural networks reduce to a simple linear model \(f\big(\theta=(W_1,\ldots,W_L),x\big)=W_1\ldots W_L \,x\). Yet, GD on this model can lead to interesting solutions. For example, Noam Razin and Nadav Cohen show in Implicit Regularization in Deep Learning May Not Be Explainable by Norms that, in a simple problem of matrix completion, GD will implicitely minimize the effective rank of the solution [video].

Hierarchical learning of the features and purification

In a series of two papers Allen-Zhu and Li investigate how the features are learnt and what is the impact of adverserial training on the learnt features. The results of these two papers are presented in this [video].

Forward feature learning / backward feature correction
In the paper Backward Feature Correction: How Deep Learning Performs Deep Learning, they show, for resnet-like NN, how the hierarchy of features are learnt via a progressive correction mechanism during the SGD.
This result might be connected to the observation of Maclach and Shalev-Shwartz that, at least in a fractal toy model, GD will find good solution, if shallow networks are already good [Is depper better only when shallow is good?, video].

Feature purification
In the paper Feature Purification: How Adversarial Training Performs Robust Deep Learning Allen-Zhu and Li consider data modeled by a sparse decomposition on a (hidden) dictionary. They show how adversarial learning lead to a purification of the learned features, leading to a more sparse (and robust) representation of the data.

Tensor program

Greg Yang and Edward Hu describe in Feature Learning in Infinite-Width Neural Networks some possible non-degenerated limits of wide deep NN. They exhibit some scaling where feature learning occurs and they explain how the limit distribution can be computed with the tensor program technique. More precisely, in the wide limit with Gaussian random initialisation, every activation vector of the NN has iid coordinates at any time during training, with a distribution recursively computable (in principle at least). [video]

This paper builts on a series of 3 previous papers on tensor programs
Tensor programs I
Tensor programs II
Tensor programs III