Multi-Modal Hallucination Control by Visual Information Grounding

Favero, Alessandro; Zancato, Luca; Trager, Matthew; Choudhary, Siddharth; Perera, Pramuditha; Achille, Alessandro; Swaminathan, Ashwin; Soatto, Stefano

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.14003 (cs)

[Submitted on 20 Mar 2024]

Title:Multi-Modal Hallucination Control by Visual Information Grounding

Authors:Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto

View PDF HTML (experimental)

Abstract:Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2403.14003 [cs.CV]
	(or arXiv:2403.14003v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.14003
Journal reference:	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Submission history

From: Alessandro Favero [view email]
[v1] Wed, 20 Mar 2024 22:05:18 UTC (4,037 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Hallucination Control by Visual Information Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Hallucination Control by Visual Information Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators