Enhance Vision-Language Alignment with Noise

Huang, Sida; Zhang, Hongyuan; Li, Xuelong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.10817 (cs)

[Submitted on 14 Dec 2024 (v1), last revised 17 Dec 2024 (this version, v2)]

Title:Enhance Vision-Language Alignment with Noise

Authors:Sida Huang, Hongyuan Zhang, Xuelong Li

View PDF HTML (experimental)

Abstract:With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise or $\pi$-noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate $\pi$-noise towards visual and linguistic modalities. Then, we propose Positive-incentive Noise Injector (PiNI), which can fine-tune CLIP via injecting noise into both visual and text encoders. Since the proposed method can learn the distribution of beneficial noise, we can obtain more diverse embeddings of vision and language to better align these two modalities for specific downstream tasks within limited computational resources. We evaluate different noise incorporation approaches and network architectures of PiNI. The evaluation across 11 datasets demonstrates its effectiveness.

Comments:	Accepted by AAAI 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.10817 [cs.CV]
	(or arXiv:2412.10817v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.10817

Submission history

From: Sida Huang [view email]
[v1] Sat, 14 Dec 2024 12:58:15 UTC (2,030 KB)
[v2] Tue, 17 Dec 2024 02:35:10 UTC (2,030 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Enhance Vision-Language Alignment with Noise

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Enhance Vision-Language Alignment with Noise

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators