Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training

Liang, Mingliang; Larson, Martha

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.16148 (cs)

[Submitted on 20 Dec 2024]

Title:Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training

Authors:Mingliang Liang, Martha Larson

View PDF HTML (experimental)

Abstract:Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of approaches: truncation, random masking, block masking and syntax masking. In this paper, we show that the best masking strategy changes over training epochs and that, given sufficient training epochs, word frequency information is what you need to achieve the best performance. Experiments on a large range of data sets demonstrate the advantages of our approach, called Contrastive Language-Image Pre-training with word Frequency Masking (CLIPF). The benefits are particularly evident as the number of input tokens decreases. We analyze the impact of CLIPF vs. other masking approaches on word frequency balance and discuss the apparently critical contribution of CLIPF in maintaining word frequency balance across POS categories.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.16148 [cs.CV]
	(or arXiv:2412.16148v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.16148

Submission history

From: Mingliang Liang [view email]
[v1] Fri, 20 Dec 2024 18:51:41 UTC (15,628 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators