Estimating Information Flow in Deep Neural Networks

Goldfeld, Ziv; Berg, Ewout van den; Greenewald, Kristjan; Melnyk, Igor; Nguyen, Nam; Kingsbury, Brian; Polyanskiy, Yury

Computer Science > Machine Learning

arXiv:1810.05728 (cs)

[Submitted on 12 Oct 2018 (v1), last revised 30 May 2019 (this version, v4)]

Title:Estimating Information Flow in Deep Neural Networks

Authors:Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, Yury Polyanskiy

View PDF

Abstract:We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information $I(X;T)$ between the input $X$ and internal representations $T$ decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true $I(X;T)$ over these networks is provably either constant (discrete $X$) or infinite (continuous $X$). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which $I(X;T)$ is a meaningful quantity that depends on the network's parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for $I(X;T)$ in noisy DNNs and observe compression in various models. By relating $I(X;T)$ in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the $T$ space. Finally, we return to the estimator of $I(X;T)$ employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

Comments:	Main text accepted to ICML 2019. This preprint contains the full version of that paper (including omitted appendices)
Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1810.05728 [cs.LG]
	(or arXiv:1810.05728v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1810.05728

Submission history

From: Kristjan Greenewald [view email]
[v1] Fri, 12 Oct 2018 21:11:30 UTC (18,016 KB)
[v2] Tue, 16 Oct 2018 02:52:45 UTC (18,016 KB)
[v3] Wed, 14 Nov 2018 16:38:23 UTC (18,017 KB)
[v4] Thu, 30 May 2019 15:42:19 UTC (32,799 KB)

Computer Science > Machine Learning

Title:Estimating Information Flow in Deep Neural Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Estimating Information Flow in Deep Neural Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators