Linguistically inspired roadmap for building biologically reliable protein language models

Vu, Mai Ha; Akbar, Rahmad; Robert, Philippe A.; Swiatczak, Bartlomiej; Greiff, Victor; Sandve, Geir Kjetil; Haug, Dag Trygve Truslew

doi:10.1038/s42256-023-00637-1

Quantitative Biology > Quantitative Methods

arXiv:2207.00982 (q-bio)

[Submitted on 3 Jul 2022 (v1), last revised 28 Apr 2023 (this version, v2)]

Title:Linguistically inspired roadmap for building biologically reliable protein language models

Authors:Mai Ha Vu, Rahmad Akbar, Philippe A. Robert, Bartlomiej Swiatczak, Victor Greiff, Geir Kjetil Sandve, Dag Trygve Truslew Haug

View PDF

Abstract:Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.

Comments:	27 pages, 4 figures
Subjects:	Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
Cite as:	arXiv:2207.00982 [q-bio.QM]
	(or arXiv:2207.00982v2 [q-bio.QM] for this version)
	https://doi.org/10.48550/arXiv.2207.00982
Journal reference:	Nat Mach Intell (2023)
Related DOI:	https://doi.org/10.1038/s42256-023-00637-1

Submission history

From: Mai Ha Vu [view email]
[v1] Sun, 3 Jul 2022 08:42:44 UTC (1,891 KB)
[v2] Fri, 28 Apr 2023 15:33:39 UTC (747 KB)

Quantitative Biology > Quantitative Methods

Title:Linguistically inspired roadmap for building biologically reliable protein language models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Quantitative Methods

Title:Linguistically inspired roadmap for building biologically reliable protein language models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators