Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Zhang, Le; Awal, Rabiul; Agrawal, Aishwarya

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.08832 (cs)

[Submitted on 15 Jun 2023 (v1), last revised 25 Apr 2024 (this version, v4)]

Title:Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Authors:Le Zhang, Rabiul Awal, Aishwarya Agrawal

View PDF HTML (experimental)

Abstract:Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at this https URL.

Comments:	CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.08832 [cs.CV]
	(or arXiv:2306.08832v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.08832

Submission history

From: Le Zhang [view email]
[v1] Thu, 15 Jun 2023 03:26:28 UTC (10,447 KB)
[v2] Sun, 2 Jul 2023 00:31:36 UTC (10,449 KB)
[v3] Thu, 28 Dec 2023 15:44:04 UTC (16,623 KB)
[v4] Thu, 25 Apr 2024 15:24:11 UTC (14,531 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators