Context Autoencoder for Self-Supervised Representation Learning

Chen, Xiaokang; Ding, Mingyu; Wang, Xiaodi; Xin, Ying; Mo, Shentong; Wang, Yunhao; Han, Shumin; Luo, Ping; Zeng, Gang; Wang, Jingdong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2202.03026v2 (cs)

[Submitted on 7 Feb 2022 (v1), revised 30 May 2022 (this version, v2), latest version 10 Aug 2023 (v3)]

Title:Context Autoencoder for Self-Supervised Representation Learning

Authors:Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

View PDF

Abstract:We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining. The goal is to pretrain an encoder by solving the pretext task: estimate the masked patches from the visible patches in an image. Our approach first feeds the visible patches into the encoder, extracting the representations. Then, we make predictions from visible patches to masked patches in the encoded representation space. We introduce an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. In other words, the predicted representations are expected to lie in the encoded representation space, which empirically shows the benefit to representation learning. Last, the predicted masked patch representations are mapped to the targets of the pretext task through a decoder. In comparison to previous MIM methods (e.g., BEiT) that couple the encoding and pretext task completion roles, our approach benefits the separation of the representation learning (encoding) role and the pretext task completion role, improving the representation learning capacity and accordingly helping more on downstream tasks. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2202.03026 [cs.CV]
	(or arXiv:2202.03026v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2202.03026

Submission history

From: Xiaokang Chen [view email]
[v1] Mon, 7 Feb 2022 09:33:45 UTC (22,831 KB)
[v2] Mon, 30 May 2022 08:42:10 UTC (23,013 KB)
[v3] Thu, 10 Aug 2023 11:01:14 UTC (10,932 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Context Autoencoder for Self-Supervised Representation Learning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Context Autoencoder for Self-Supervised Representation Learning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators