CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Liu, Yanqing; Li, Xianhang; Wang, Zeyu; Zhao, Bingchen; Xie, Cihang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.16828 (cs)

[Submitted on 25 Nov 2024]

Title:CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Authors:Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, Cihang Xie

View PDF HTML (experimental)

Abstract:Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions -- the short synthetic captions can generally lead to MUCH higher performance than full-length ones -- we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process -- by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks. Our project page is this https URL.

Comments:	12 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2411.16828 [cs.CV]
	(or arXiv:2411.16828v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.16828

Submission history

From: Yanqing Liu [view email]
[v1] Mon, 25 Nov 2024 18:49:02 UTC (269 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators