Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Panos, Aristeidis; Aljundi, Rahaf; Reino, Daniel Olmeda; Turner, Richard E

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.16526 (cs)

[Submitted on 23 Jul 2024]

Title:Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Authors:Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E Turner

View PDF HTML (experimental)

Abstract:Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and computational efficiency characterize our approach.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2407.16526 [cs.CV]
	(or arXiv:2407.16526v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.16526

Submission history

From: Rahaf Aljundi [view email]
[v1] Tue, 23 Jul 2024 14:39:40 UTC (22,021 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators