ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Lv, Liuzhenghao; Lin, Zongying; Li, Hao; Liu, Yuyang; Cui, Jiaxi; Chen, Calvin Yu-Chian; Yuan, Li; Tian, Yonghong

Computer Science > Computational Engineering, Finance, and Science

arXiv:2402.16445v1 (cs)

[Submitted on 26 Feb 2024 (this version), latest version 16 Jul 2024 (v2)]

Title:ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Authors:Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, Yonghong Tian

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs), including GPT-x and LLaMA2, have achieved remarkable performance in multiple Natural Language Processing (NLP) tasks. Under the premise that protein sequences constitute the protein language, Protein Large Language Models (ProLLMs) trained on protein corpora excel at de novo protein sequence generation. However, as of now, unlike LLMs in NLP, no ProLLM is capable of multiple tasks in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current ProLLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a ProLLM capable of handling multiple PLP tasks. Specifically, our framework utilizes low-rank adaptation and employs a two-stage training approach, and it is distinguished by its universality, low overhead, and scalability. Through training under this framework, we propose the ProLLaMA model, the first known ProLLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. In the protein property prediction task, ProLLaMA achieves nearly 100\% accuracy across many categories. The latter two tasks are beyond the reach of other ProLLMs. Code is available at \url{this https URL}.

Subjects:	Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
Cite as:	arXiv:2402.16445 [cs.CE]
	(or arXiv:2402.16445v1 [cs.CE] for this version)
	https://doi.org/10.48550/arXiv.2402.16445

Submission history

From: Liuzhenghao Lv [view email]
[v1] Mon, 26 Feb 2024 09:43:52 UTC (3,084 KB)
[v2] Tue, 16 Jul 2024 10:35:34 UTC (5,423 KB)

Computer Science > Computational Engineering, Finance, and Science

Title:ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computational Engineering, Finance, and Science

Title:ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators