Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Nagrecha, Kabir; Kumar, Arun

Computer Science > Machine Learning

arXiv:2309.01226 (cs)

[Submitted on 3 Sep 2023 (v1), last revised 13 Dec 2023 (this version, v2)]

Title:Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Authors:Kabir Nagrecha, Arun Kumar

View PDF HTML (experimental)

Abstract:Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP.
We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.

Comments:	Accepted at VLDB '24. Code available: this https URL. 12 pages + 3 pages references + 2 pages appendix
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2309.01226 [cs.LG]
	(or arXiv:2309.01226v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2309.01226

Submission history

From: Kabir Nagrecha [view email]
[v1] Sun, 3 Sep 2023 17:19:11 UTC (2,543 KB)
[v2] Wed, 13 Dec 2023 18:42:58 UTC (14,242 KB)

Computer Science > Machine Learning

Title:Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators