Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training

Cai, Xunxin; Wang, Chengrui; Long, Qingqing; Zhou, Yuanchun; Xiao, Meng

Abstract:The rapid advancement of large language models (LLMs) in biological-medical applications has highlighted a gap between their potential and the limited scale and often low quality of available open-source annotated textual datasets. In addition, the inherent complexity of the biomedical knowledge hierarchy significantly hampers efforts to bridge this this http URL LLMs themselves play a pivotal role in overcoming this limitation? Motivated by this question, we investigate this challenge in the present this http URL propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature. Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain, guided by the biomedical knowledge hierarchy through medical subject headings (MeSH). This comprehensive framework establishes an automated workflow, thereby eliminating the need for manual intervention. Furthermore, we conducted comprehensive experiments to evaluate the impact of our framework-generated data on downstream language models of varying sizes. Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain and powerful close-source models represented by GPT-4. Notably, the generated AI-Ready dataset enabled the Llama3-70B base model to outperform GPT-4 using MedPrompt with multiple times the number of parameters. Detailed case studies and ablation experiments underscore the significance of each component within our framework

Comments:	16 pages, accepted by DASFAA 2025
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.15108 [cs.CL]
	(or arXiv:2501.15108v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.15108

Computer Science > Computation and Language

Title:Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators