PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Lee, Junmyeong; Hwang, Eui Jun; Cho, Sukmin; Park, Jong C.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.03005 (cs)

[Submitted on 6 Jan 2025]

Title:PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Authors:Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, Jong C. Park

View PDF HTML (experimental)

Abstract:In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.03005 [cs.CV]
	(or arXiv:2501.03005v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.03005

Submission history

From: Junmyeong Lee [view email]
[v1] Mon, 6 Jan 2025 13:30:16 UTC (831 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators