Diffusion Feedback Helps CLIP See Better

Wang, Wenxuan; Sun, Quan; Zhang, Fan; Tang, Yepeng; Liu, Jing; Wang, Xinlong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.20171 (cs)

[Submitted on 29 Jul 2024 (v1), last revised 24 Aug 2024 (this version, v4)]

Title:Diffusion Feedback Helps CLIP See Better

Authors:Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang

View PDF HTML (experimental)

Abstract:Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.20171 [cs.CV]
	(or arXiv:2407.20171v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.20171

Submission history

From: Wenxuan Wang [view email]
[v1] Mon, 29 Jul 2024 17:00:09 UTC (2,062 KB)
[v2] Tue, 6 Aug 2024 08:42:47 UTC (2,062 KB)
[v3] Sun, 18 Aug 2024 15:15:36 UTC (2,062 KB)
[v4] Sat, 24 Aug 2024 03:55:36 UTC (2,152 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Diffusion Feedback Helps CLIP See Better

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Diffusion Feedback Helps CLIP See Better

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators