Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Ibtehaz, Nabil; Sourav, S. M. Shakhawat Hossain; Bayzid, Md. Shamsuzzoha; Rahman, M. Sohel

Quantitative Biology > Quantitative Methods

arXiv:2012.03324 (q-bio)

[Submitted on 6 Dec 2020]

Title:Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Authors:Nabil Ibtehaz, S. M. Shakhawat Hossain Sourav, Md. Shamsuzzoha Bayzid, M. Sohel Rahman

View PDF

Abstract:Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences.
Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it.
Results: We propose a novel $k$-mer embedding scheme, Align-gram, which is capable of mapping the similar $k$-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.

Subjects:	Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Genomics (q-bio.GN)
Cite as:	arXiv:2012.03324 [q-bio.QM]
	(or arXiv:2012.03324v1 [q-bio.QM] for this version)
	https://doi.org/10.48550/arXiv.2012.03324

Submission history

From: Nabil Ibtehaz [view email]
[v1] Sun, 6 Dec 2020 17:04:17 UTC (310 KB)

Quantitative Biology > Quantitative Methods

Title:Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Quantitative Methods

Title:Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators