Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

Hein, Dennis; Chen, Zhihong; Ostmeier, Sophie; Xu, Justin; Varma, Maya; Reis, Eduardo Pontes; Michalson, Arne Edward; Bluethgen, Christian; Shin, Hyun Joo; Langlotz, Curtis; Chaudhari, Akshay S

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.07025 (cs)

[Submitted on 9 Oct 2024]

Title:Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

Authors:Dennis Hein, Zhihong Chen, Sophie Ostmeier, Justin Xu, Maya Varma, Eduardo Pontes Reis, Arne Edward Michalson, Christian Bluethgen, Hyun Joo Shin, Curtis Langlotz, Akshay S Chaudhari

View PDF HTML (experimental)

Abstract:Radiologists play a crucial role by translating medical images into medical reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning (SFT). Meanwhile, in the general domain, additional preference fine-tuning has become standard practice. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback. We propose a scalable automated preference alignment technique for VLMs in radiology, focusing on chest X-ray (CXR) report generation. Our method leverages publicly available datasets with an LLM-as-a-Judge mechanism, eliminating the need for additional expert radiologist feedback. We evaluate and benchmark five direct alignment algorithms (DAAs). Our results show up to a 57.4% improvement in average GREEN scores, a LLM-based metric for evaluating CXR reports, and a 9.2% increase in an average across six metrics (domain specific and general), compared to the SFT baseline. We study reward overoptimization via length exploitation, with reports lengthening by up to 3.2x. To assess a potential alignment tax, we benchmark on six additional diverse tasks, finding no significant degradations. A reader study involving four board-certified radiologists indicates win rates of up to 0.62 over the SFT baseline, while significantly penalizing verbosity. Our analysis provides actionable insights for the development of VLMs in high-stakes fields like radiology.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2410.07025 [cs.CV]
	(or arXiv:2410.07025v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.07025

Submission history

From: Dennis Hein [view email]
[v1] Wed, 9 Oct 2024 16:07:11 UTC (4,241 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators