The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

Mo, Shentong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.17566 (cs)

[Submitted on 23 Dec 2024]

Title:The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

Authors:Shentong Mo

View PDF HTML (experimental)

Abstract:Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In particular, using ViT-base, we improve the fine-tuning results of the vanilla MAE from 83.6% to 85.7%.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Cite as:	arXiv:2412.17566 [cs.CV]
	(or arXiv:2412.17566v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.17566

Submission history

From: Shentong Mo [view email]
[v1] Mon, 23 Dec 2024 13:37:26 UTC (1,281 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators