Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling

Will, Jonathan; Treide, Nico; Thamsen, Lauritz; Kao, Odej

doi:10.1109/BigData62323.2024.10825367

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2501.14456 (cs)

[Submitted on 24 Jan 2025]

Title:Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling

Authors:Jonathan Will, Nico Treide, Lauritz Thamsen, Odej Kao

View PDF HTML (experimental)

Abstract:Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual resource allocations, such as memory and CPU cores, must meet the specific resource requirements of the job. An alternative to selecting a static resource allocation for a job execution is autoscaling as implemented for example by Spark.
In this paper, we evaluate the resource efficiency of autoscaling batch data processing jobs based on resource demand both conceptually and experimentally by analyzing a new dataset of Spark job executions on Google Dataproc Serverless. In our experimental evaluation, we show that there is no significant resource efficiency gain over static resource allocations. We found that the inherent conceptual limitations of such autoscaling approaches are the inelasticity of node size as well as the inelasticity of the ratio of memory to CPU cores.

Comments:	6 pages, 3 figures, 6 tables. IEEE Big Data 2024
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
ACM classes:	C.2.4; I.2.8; I.2.6
Cite as:	arXiv:2501.14456 [cs.DC]
	(or arXiv:2501.14456v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2501.14456
Related DOI:	https://doi.org/10.1109/BigData62323.2024.10825367

Submission history

From: Jonathan Will [view email]
[v1] Fri, 24 Jan 2025 12:38:25 UTC (56 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators