A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Cao, Yuan; Gu, Quanquan

Computer Science > Machine Learning

arXiv:1902.01384v2 (cs)

[Submitted on 4 Feb 2019 (v1), revised 15 Feb 2019 (this version, v2), latest version 27 Nov 2019 (v4)]

Title:A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Authors:Yuan Cao, Quanquan Gu

View PDF

Abstract:Empirical studies show that gradient based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that gradient-based methods with proper random initialization can find the global minima of the training loss in over-parameterized DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to an algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that can explain the good generalization performance of over-parameterized deep neural networks learned by gradient descent.

Comments:	54 pages. This version improves the sample complexity result so that it almost does not depend on the number of nodes per layer (only has a logarithmic dependence)
Subjects:	Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Cite as:	arXiv:1902.01384 [cs.LG]
	(or arXiv:1902.01384v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1902.01384

Submission history

From: Quanquan Gu [view email]
[v1] Mon, 4 Feb 2019 18:52:43 UTC (47 KB)
[v2] Fri, 15 Feb 2019 18:57:24 UTC (48 KB)
[v3] Tue, 2 Apr 2019 17:57:59 UTC (50 KB)
[v4] Wed, 27 Nov 2019 07:08:38 UTC (31 KB)

Computer Science > Machine Learning

Title:A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators