Multi-Modal Mixup for Robust Fine-tuning

So, Junhyuk; Oh, Changdae; Shin, Minchul; Song, Kyungwoo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.03897v1 (cs)

[Submitted on 8 Mar 2022 (this version), latest version 7 Nov 2023 (v4)]

Title:Multi-Modal Mixup for Robust Fine-tuning

Authors:Junhyuk So, Changdae Oh, Minchul Shin, Kyungwoo Song

View PDF

Abstract:Pre-trained large-scale models provide a transferable embedding, and they show comparable performance on the diverse downstream task. However, the transferability of multi-modal learning is restricted, and the analysis of learned embedding has not been explored well. This paper provides a perspective to understand the multi-modal embedding in terms of uniformity and alignment. We newly find that the representation learned by multi-modal learning models such as CLIP has a two separated representation space for each heterogeneous dataset with less alignment. Besides, there are unexplored large intermediate areas between two modalities with less uniformity. Less robust embedding might restrict the transferability of the representation for the downstream task. This paper provides a new end-to-end fine-tuning method for robust representation that encourages better uniformity and alignment score. First, we propose a multi-modal Mixup, $m^{2}$-Mix that mixes the representation of image and text to generate the hard negative samples. Second, we fine-tune the multi-modal model on a hard negative sample as well as normal negative and positive samples with contrastive learning. Our multi-modal Mixup provides a robust representation, and we validate our methods on classification, retrieval, and structure-awareness task.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2203.03897 [cs.CV]
	(or arXiv:2203.03897v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.03897

Submission history

From: Kyungwoo Song [view email]
[v1] Tue, 8 Mar 2022 07:34:52 UTC (6,058 KB)
[v2] Wed, 19 Oct 2022 07:42:56 UTC (11,020 KB)
[v3] Sun, 29 Oct 2023 00:01:40 UTC (6,675 KB)
[v4] Tue, 7 Nov 2023 00:34:37 UTC (6,994 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Mixup for Robust Fine-tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Mixup for Robust Fine-tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators