FoldToken: Learning Protein Language via Vector Quantization and Beyond

Gao, Zhangyang; Tan, Cheng; Wang, Jue; Huang, Yufei; Wu, Lirong; Li, Stan Z.

Quantitative Biology > Biomolecules

arXiv:2403.09673 (q-bio)

[Submitted on 4 Feb 2024 (v1), last revised 19 Mar 2024 (this version, v2)]

Title:FoldToken: Learning Protein Language via Vector Quantization and Beyond

Authors:Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, Stan Z. Li

View PDF HTML (experimental)

Abstract:Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This innovative approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We refer to the learned discrete symbols as \textbf{FoldToken}, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting and antibody design tasks, building the first GPT-style model (\textbf{FoldGPT}) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (\textbf{SoftCVQ}).

Subjects:	Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2403.09673 [q-bio.BM]
	(or arXiv:2403.09673v2 [q-bio.BM] for this version)
	https://doi.org/10.48550/arXiv.2403.09673

Submission history

From: Zhangyang Gao [view email]
[v1] Sun, 4 Feb 2024 12:18:51 UTC (19,191 KB)
[v2] Tue, 19 Mar 2024 05:29:23 UTC (19,191 KB)

Quantitative Biology > Biomolecules

Title:FoldToken: Learning Protein Language via Vector Quantization and Beyond

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Biomolecules

Title:FoldToken: Learning Protein Language via Vector Quantization and Beyond

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators