AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Zhu, Yuhan; Ji, Yuyang; Zhao, Zhiyu; Wu, Gangshan; Wang, Limin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.04603 (cs)

[Submitted on 5 Jul 2024]

Title:AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Authors:Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, Limin Wang

View PDF HTML (experimental)

Abstract:Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.04603 [cs.CV]
	(or arXiv:2407.04603v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.04603

Submission history

From: Yuhan Zhu [view email]
[v1] Fri, 5 Jul 2024 15:52:23 UTC (14,370 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators