scReader: Prompting Large Language Models to Interpret scRNA-seq Data

Li, Cong; Long, Qingqing; Zhou, Yuanchun; Xiao, Meng

Abstract:Large language models (LLMs) have demonstrated remarkable advancements, primarily due to their capabilities in modeling the hidden relationships within text sequences. This innovation presents a unique opportunity in the field of life sciences, where vast collections of single-cell omics data from multiple species provide a foundation for training foundational models. However, the challenge lies in the disparity of data scales across different species, hindering the development of a comprehensive model for interpreting genetic data across diverse organisms. In this study, we propose an innovative hybrid approach that integrates the general knowledge capabilities of LLMs with domain-specific representation models for single-cell omics data interpretation. We begin by focusing on genes as the fundamental unit of representation. Gene representations are initialized using functional descriptions, leveraging the strengths of mature language models such as LLaMA-2. By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types. In the experiments, we constructed developmental cells from humans and mice, specifically targeting cells that are challenging to annotate. We evaluated our methodology through basic tasks such as cell annotation and visualization analysis. The results demonstrate the efficacy of our approach compared to other methods using LLMs, highlighting significant improvements in accuracy and interoperability. Our hybrid approach enhances the representation of single-cell data and offers a robust framework for future research in cross-species genetic analysis.

Comments:	8 pages, Accepted by ICDM 2024
Subjects:	Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2412.18156 [q-bio.GN]
	(or arXiv:2412.18156v1 [q-bio.GN] for this version)
	https://doi.org/10.48550/arXiv.2412.18156

Quantitative Biology > Genomics

Title:scReader: Prompting Large Language Models to Interpret scRNA-seq Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators