EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

Villa, Andrés; Alcázar, Juan León; Alfarra, Motasem; Araujo, Vladimir; Soto, Alvaro; Ghanem, Bernard

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.02699 (cs)

[Submitted on 6 Jan 2025]

Title:EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

Authors:Andrés Villa, Juan León Alcázar, Motasem Alfarra, Vladimir Araujo, Alvaro Soto, Bernard Ghanem

View PDF HTML (experimental)

Abstract:Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining approach that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training. As a result, EAGLE achieves a significant reduction in hallucinations across multiple challenging benchmarks and tasks.

Comments:	12 pages, 4 figures, 8 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.02699 [cs.CV]
	(or arXiv:2501.02699v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.02699

Submission history

From: Andrés Villa [view email]
[v1] Mon, 6 Jan 2025 00:39:31 UTC (32,251 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators