Can linguists better understand DNA?

Liang, Wang

Abstract:Multilingual transfer ability, which reflects how well models fine-tuned on one source language can be applied to other languages, has been well studied in multilingual pre-trained models. However, the existence of such capability transfer between natural language and gene sequences/languages remains under this http URL study addresses this gap by drawing inspiration from the sentence-pair classification task used for evaluating sentence similarity in natural language. We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination). These tasks were designed to validate the transferability of capabilities from natural language to gene sequences. Even a small-scale pre-trained model like GPT-2-small, which was pre-trained on English, achieved an accuracy of 78% on the DNA-pair classification task after being fine-tuned on English sentence-pair classification data(XTREME PAWS-X). While training a BERT model on multilingual text, the precision reached 89%. On the more complex DNA-protein-pair classification task, however, the model's output was barely distinguishable from random this http URL validation has confirmed that the transfer of capabilities from natural language to biological language is unequivocally present. Building on this foundation, we have also investigated the impact of model parameter scale and pre-training on this capability transfer. We provide recommendations for facilitating the transfer of capabilities from natural language to genetic language,as well as new approaches for conducting biological research based on this this http URL study offers an intriguing new perspective on exploring the relationship between natural language and genetic language.

Comments:	20 pages,8 figures
Subjects:	Computation and Language (cs.CL); Genomics (q-bio.GN)
MSC classes:	92-10
ACM classes:	J.3
Cite as:	arXiv:2412.07678 [cs.CL]
	(or arXiv:2412.07678v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.07678

Computer Science > Computation and Language

Title:Can linguists better understand DNA?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators