RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Zhao, Mark; Choudhary, Dhruv; Tyagi, Devashish; Somani, Ajay; Kaplan, Max; Lin, Sung-Han; Pumma, Sarunya; Park, Jongsoo; Basant, Aarti; Agarwal, Niket; Wu, Carole-Jean; Kozyrakis, Christos

Computer Science > Machine Learning

arXiv:2211.05239 (cs)

[Submitted on 9 Nov 2022 (v1), last revised 1 May 2023 (this version, v4)]

Title:RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Authors:Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, Christos Kozyrakis

View PDF

Abstract:We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.

Comments:	Published in the Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys 2023)
Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Performance (cs.PF)
Cite as:	arXiv:2211.05239 [cs.LG]
	(or arXiv:2211.05239v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2211.05239

Submission history

From: Mark Zhao [view email]
[v1] Wed, 9 Nov 2022 22:21:19 UTC (776 KB)
[v2] Mon, 14 Nov 2022 22:07:19 UTC (776 KB)
[v3] Wed, 26 Apr 2023 00:58:57 UTC (781 KB)
[v4] Mon, 1 May 2023 19:37:39 UTC (781 KB)

Computer Science > Machine Learning

Title:RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators