A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Xie, Yunfei; Wu, Juncheng; Tu, Haoqin; Yang, Siwei; Zhao, Bingchen; Zong, Yongshuo; Jin, Qiao; Xie, Cihang; Zhou, Yuyin

Abstract:Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI's o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at this https URL for future research.

Comments:	The first four authors contributed equally, project page available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2409.15277 [cs.CL]
	(or arXiv:2409.15277v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.15277

Computer Science > Computation and Language

Title:A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators