Dependency Leakage: Analysis and Scalable Estimators

Barnes, Matt; Dubrawski, Artur

Statistics > Machine Learning

arXiv:1807.06713v1 (stat)

[Submitted on 18 Jul 2018 (this version), latest version 28 Dec 2018 (v2)]

Title:Dependency Leakage: Analysis and Scalable Estimators

Authors:Matt Barnes, Artur Dubrawski

View PDF

Abstract:In this paper, we prove the first theoretical results on dependency leakage -- a phenomenon in which learning on noisy clusters biases cross-validation and model selection results. This is a major concern for domains involving human record databases (e.g. medical, census, advertising), which are almost always noisy due to the effects of record linkage and which require special attention to machine learning bias. The proposed theoretical properties justify regularization choices in several existing statistical estimators and allow us to construct the first hypothesis test for cross-validation bias due to dependency leakage. Furthermore, we propose a novel matrix sketching technique which, along with standard function approximation techniques, enables dramatically improving the sample and computational scalability of existing estimators. Empirical results on several benchmark datasets validate our theoretical results and proposed methods.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:1807.06713 [stat.ML]
	(or arXiv:1807.06713v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1807.06713

Submission history

From: Matt Barnes [view email]
[v1] Wed, 18 Jul 2018 00:13:31 UTC (195 KB)
[v2] Fri, 28 Dec 2018 23:20:23 UTC (84 KB)

Statistics > Machine Learning

Title:Dependency Leakage: Analysis and Scalable Estimators

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Dependency Leakage: Analysis and Scalable Estimators

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators