High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Li, Yifan; Guidi, Giulia

doi:10.1145/3673038.3673072

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2407.07718 (cs)

[Submitted on 10 Jul 2024]

Title:High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Authors:Yifan Li, Giulia Guidi

View PDF HTML (experimental)

Abstract:In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. Furthermore, we introduce an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes. Finally, we integrated HySortK into an existing genome assembly pipeline and achieved up to 1.8x speedup, proving its flexibility and practicality in real-world scenarios.

Comments:	10 pages
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Genomics (q-bio.GN)
Cite as:	arXiv:2407.07718 [cs.DC]
	(or arXiv:2407.07718v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2407.07718
Journal reference:	In The 53rd International Conference on Parallel Processing (ICPP 24), August 12-15, 2024, Gotland, Sweden
Related DOI:	https://doi.org/10.1145/3673038.3673072

Submission history

From: Giulia Guidi [view email]
[v1] Wed, 10 Jul 2024 14:50:00 UTC (782 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators