Prototypical Contrastive Language Image Pretraining

Chen, Delong; Wu, Zhao; Liu, Fan; Yang, Zaiquan; Huang, Yixiang; Bao, Yiping; Zhou, Erjin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.10996v2 (cs)

[Submitted on 22 Jun 2022 (v1), revised 11 Aug 2022 (this version, v2), latest version 21 Nov 2023 (v4)]

Title:Prototypical Contrastive Language Image Pretraining

Authors:Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Yixiang Huang, Yiping Bao, Erjin Zhou

View PDF

Abstract:Contrastive Language Image Pretraining (CLIP) received widespread attention since its learned representations can be transferred well to various downstream tasks. During CLIP training, the InfoNCE objective aims to align positive image-text pairs and separate negative ones. In this paper, we show a representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. We introduce Prototypical Contrastive Language Image Pretraining (ProtoCLIP) to enhance such grouping by boosting its efficiency and increasing its robustness against modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. We further propose Prototypical Back Translation (PBT) to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. PBT also enables us to introduce additional external teachers with richer prior knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On larger YFCC dataset, ProtoCLIP matches the performance of CLIP with 4$\times$fewer pretraining epochs. Codes are available at this https URL.

Comments:	Preprint
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.10996 [cs.CV]
	(or arXiv:2206.10996v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.10996

Submission history

From: Delong Chen [view email]
[v1] Wed, 22 Jun 2022 11:55:53 UTC (6,554 KB)
[v2] Thu, 11 Aug 2022 05:15:15 UTC (6,743 KB)
[v3] Wed, 8 Nov 2023 03:26:43 UTC (7,289 KB)
[v4] Tue, 21 Nov 2023 04:18:38 UTC (7,290 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Prototypical Contrastive Language Image Pretraining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Prototypical Contrastive Language Image Pretraining

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators