Pareto-optimal data compression for binary classification tasks

Tegmark, Max; Wu, Tailin

doi:10.3390/e22010007

Computer Science > Machine Learning

arXiv:1908.08961 (cs)

[Submitted on 23 Aug 2019 (v1), last revised 15 Jan 2020 (this version, v2)]

Title:Pareto-optimal data compression for binary classification tasks

Authors:Max Tegmark (MIT), Tailin Wu (MIT)

View PDF

Abstract:The goal of lossy data compression is to reduce the storage cost of a data set $X$ while retaining as much information as possible about something ($Y$) that you care about. For example, what aspects of an image $X$ contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping $X\to Z\equiv f(X)$ that maximizes the mutual information $I(Z,Y)$ while the entropy $H(Z)$ is kept below some fixed threshold. We present a method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable $X$ (an image, say) drawn from a class $Y\in\{1,...,n\}$ can be distilled into a vector $W=f(X)\in \mathbb{R}^{n-1}$ losslessly, so that $I(W,Y)=I(X,Y)$; for example, for a binary classification task of cats and dogs, each image $X$ is mapped into a single real number $W$ retaining all information that helps distinguish cats from dogs. For the $n=2$ case of binary classification, we then show how $W$ can be further compressed into a discrete variable $Z=g_\beta(W)\in\{1,...,m_\beta\}$ by binning $W$ into $m_\beta$ bins, in such a way that varying the parameter $\beta$ sweeps out the full Pareto frontier, solving a generalization of the Discrete Information Bottleneck (DIB) problem. We argue that the most interesting points on this frontier are "corners" maximizing $I(Z,Y)$ for a fixed number of bins $m=2,3...$ which can be conveniently be found without multiobjective optimization. We apply this method to the CIFAR-10, MNIST and Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm.

Comments:	Replaced to match version published in Entropy. 17 pages, 9 figs; improved discussion, comparison with Blahut-Arimoto method
Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (stat.ML)
Cite as:	arXiv:1908.08961 [cs.LG]
	(or arXiv:1908.08961v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1908.08961
Journal reference:	Entropy (2020), 22, 7
Related DOI:	https://doi.org/10.3390/e22010007

Submission history

From: Max Tegmark [view email]
[v1] Fri, 23 Aug 2019 18:00:40 UTC (5,073 KB)
[v2] Wed, 15 Jan 2020 18:43:57 UTC (5,189 KB)

Computer Science > Machine Learning

Title:Pareto-optimal data compression for binary classification tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Pareto-optimal data compression for binary classification tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators