SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

Ye, Chengxi; Cannon, Charles H.; Ma, Zhanshan Sam; Yu, Douglas W.; Pop, Mihai

Computer Science > Data Structures and Algorithms

arXiv:1108.3556 (cs)

[Submitted on 17 Aug 2011 (v1), last revised 9 Jan 2013 (this version, v2)]

Title:SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

Authors:Chengxi Ye, Charles H. Cannon, Zhanshan Sam Ma, Douglas W. Yu, Mihai Pop

View PDF

Abstract:The formal version of our work has been published in BMC Bioinformatics and can be found here: this http URL Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90% memory consumption in comparison with the state-of-art assemblers, but it requires rounds of denoising to accurately assemble genomes. In this paper, we introduce a new general model for genome assembly that uses only sparse k-mers. The new model replaces the idea of the de Bruijn graph from the beginning, and achieves similar memory efficiency and much better robustness compared with our previous SparseAssembler1. Results: We demonstrate that the decomposition of reads of all overlapping k-mers, which is used in existing de Bruijn graph genome assemblers, is overly cautious. We introduce a sparse k-mer graph structure for saving sparse k-mers, which greatly reduces memory space requirements necessary for de novo genome assembly. In contrast with the de Bruijn graph approach, we devise a simple but powerful strategy, i.e., finding links between the k-mers in the genome and traversing following the links, which can be done by saving only a few k-mers. To implement the strategy, we need to only select some k-mers that may not even be overlapping ones, and build the links between these k-mers indicated by the reads. We can traverse through this sparse k-mer graph to build the contigs, and ultimately complete the genome assembly. Since the new sparse k-mers graph shares almost all advantages of de Bruijn graph, we are able to adapt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms.

Comments:	Corresponding authors: Zhanshan (Sam) Ma, ma@vandals.this http URL; Mihai Pop, mpop@umiacs.this http URL \|\| Availability: Programs in both Windows and Linux are available at: this https URL
Subjects:	Data Structures and Algorithms (cs.DS); Genomics (q-bio.GN)
Cite as:	arXiv:1108.3556 [cs.DS]
	(or arXiv:1108.3556v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1108.3556

Submission history

From: Chengxi Ye [view email]
[v1] Wed, 17 Aug 2011 19:24:45 UTC (167 KB)
[v2] Wed, 9 Jan 2013 19:12:17 UTC (167 KB)

Computer Science > Data Structures and Algorithms

Title:SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators