MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

He, Yongjun; Waleffe, Roger; Han, Zhichao; George, Johnu; Yuan, Binhang; Zhang, Zitao; Shan, Yinan; Zhao, Yang; Dutta, Debojyoti; Rekatsinas, Theodoros; Zhang, Ce

Computer Science > Machine Learning

arXiv:2504.01506 (cs)

[Submitted on 2 Apr 2025]

Title:MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

Authors:Yongjun He, Roger Waleffe, Zhichao Han, Johnu George, Binhang Yuan, Zitao Zhang, Yinan Shan, Yang Zhao, Debojyoti Dutta, Theodoros Rekatsinas, Ce Zhang

View PDF HTML (experimental)

Abstract:Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at this https URL.

Comments:	To appear in ICDE 2025
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2504.01506 [cs.LG]
	(or arXiv:2504.01506v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.01506

Submission history

From: Yongjun He [view email]
[v1] Wed, 2 Apr 2025 08:57:01 UTC (619 KB)

Computer Science > Machine Learning

Title:MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators