Recognizing Variables from their Data via Deep Embeddings of Distributions

Mueller, Jonas; Smola, Alex

Computer Science > Machine Learning

arXiv:1909.04844 (cs)

[Submitted on 11 Sep 2019]

Title:Recognizing Variables from their Data via Deep Embeddings of Distributions

Authors:Jonas Mueller, Alex Smola

View PDF

Abstract:A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be more robustly addressed by leveraging the data values themselves rather than just relying on their arbitrarily selected variable names. Here, we present a computationally efficient method to identify high-confidence variable matches between a given set of data values and a large repository of previously encountered datasets. Our approach enjoys numerous advantages over distributional similarity based techniques because we leverage learned vector embeddings of datasets which adaptively account for natural forms of data variation encountered in practice. Based on the neural architecture of deep sets, our embeddings can be computed for both numeric and string data. In dataset search and schema matching tasks, our methods outperform standard statistical techniques and we find that the learned embeddings generalize well to new data sources.

Comments:	IEEE International Conference on Data Mining (ICDM), 2019
Subjects:	Machine Learning (cs.LG); Databases (cs.DB); Machine Learning (stat.ML)
Cite as:	arXiv:1909.04844 [cs.LG]
	(or arXiv:1909.04844v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1909.04844

Submission history

From: Jonas Mueller [view email]
[v1] Wed, 11 Sep 2019 04:10:48 UTC (158 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2019-09

Change to browse by:

cs
cs.DB
stat
stat.ML

References & Citations

DBLP - CS Bibliography

listing | bibtex

Jonas Mueller
Alex Smola

export BibTeX citation

Computer Science > Machine Learning

Title:Recognizing Variables from their Data via Deep Embeddings of Distributions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Recognizing Variables from their Data via Deep Embeddings of Distributions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators