Mathematics of Double Descent
More Info
expand_more
Abstract
Recently, there has been an increase in literature about the Double Descent phenomenon for heavily over-parameterized models. Double Descent refers to the shape of the test risk curve, which can show a second descent in the over-parameterized regime, resulting in the remarkable combination of both low training and low test risk. However, much is still unknown about this behaviour. In this thesis we consider Double Descent and more specifically 'beneficial overfitting', meaning that the lowest test risk as a function of the number of parameters is achieved in the over-parameterized regime. We are mainly interested in under what conditions beneficial overfitting occurs. We start by exploring the test risk behaviour for simple linear regression models, with isotropic Gaussian, general Gaussian and sub-Gaussian covariates. For random feature selection and isotropic covariance, beneficial overfitting occurs for a large signal-to-noise ratio. For deterministic feature selection and isotropic covariance, beneficial overfitting occurs if we select features corresponding to the lowest weights. Without feature selection, beneficial overfitting occurs if the eigenvalues of the covariance matrix have a long, flat tail. In the second part of this thesis we check whether the same or similar results can be applied to other models as well. Specifically, we look at kernel regression, random Fourier features and a classification model. It seems that the linear regression results agree with the random Fourier features model, linear and quadratic kernel regression and classification model, but are not applicable for the Gaussian kernel regression case. Hence, more factors need to be considered, besides the eigenvalue behaviour of the covariance or kernel matrix and the way in which features are selected, to fully explain Double Descent and beneficial overfitting.