To Store or Not to Store: a graph theoretical approach for Dataset Versioning

Guo, Anxin; Li, Jingwei; Sukprasert, Pattara; Khuller, Samir; Deshpande, Amol; Mukherjee, Koyel

Abstract:In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.
We establish that the currently best-known heuristic, LMG, can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get $o(n)$-approximation for MSR on general graphs even if we relax the storage constraints by an $O(\log n)$ factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.
Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential ``moves'', to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BoundedMin Retrieval (BMR).

Comments:	Accepted by IPDPS 2024
Subjects:	Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2402.11741 [cs.DS]
	(or arXiv:2402.11741v1 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2402.11741

Computer Science > Data Structures and Algorithms

Title:To Store or Not to Store: a graph theoretical approach for Dataset Versioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators