SPRISS: Approximating Frequent $k$-mers by Sampling Reads, and Applications

Santoro, Diego; Pellegrina, Leonardo; Vandin, Fabio

Abstract:The extraction of $k$-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all $k$-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of $k$-mers to be considered. However, in several applications, only frequent $k$-mers, which are $k$-mers appearing in a relatively high proportion of the data, are required by the analysis. In this work we present SPRISS, a new efficient algorithm to approximate frequent $k$-mers and their frequencies in next-generation sequencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any $k$-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent $k$-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets and the identification of discriminative $k$-mers, to extract insights in a fraction of the time required by the analysis of the whole dataset.

Comments:	Accepted to RECOMB 2021
Subjects:	Quantitative Methods (q-bio.QM)
Cite as:	arXiv:2101.07117 [q-bio.QM]
	(or arXiv:2101.07117v1 [q-bio.QM] for this version)
	https://doi.org/10.48550/arXiv.2101.07117

Quantitative Biology > Quantitative Methods

Title:SPRISS: Approximating Frequent $k$-mers by Sampling Reads, and Applications

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators