ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Chen, Delong; Wu, Zhao; Liu, Fan; Yang, Zaiquan; Huang, Huaxi; Tan, Ying; Zhou, Erjin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.10996 (cs)

[Submitted on 22 Jun 2022 (v1), last revised 21 Nov 2023 (this version, v4)]

Title:ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Authors:Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Huaxi Huang, Ying Tan, Erjin Zhou

View PDF

Abstract:Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at this https URL.

Comments:	Accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.10996 [cs.CV]
	(or arXiv:2206.10996v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.10996

Submission history

From: Delong Chen [view email]
[v1] Wed, 22 Jun 2022 11:55:53 UTC (6,554 KB)
[v2] Thu, 11 Aug 2022 05:15:15 UTC (6,743 KB)
[v3] Wed, 8 Nov 2023 03:26:43 UTC (7,289 KB)
[v4] Tue, 21 Nov 2023 04:18:38 UTC (7,290 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators