Motion-Guided Masking for Spatiotemporal Representation Learning

Fan, David; Wang, Jue; Liao, Shuai; Zhu, Yi; Bhat, Vimal; Santos-Villalobos, Hector; MV, Rohith; Li, Xinyu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.12962 (cs)

[Submitted on 24 Aug 2023]

Title:Motion-Guided Masking for Spatiotemporal Representation Learning

Authors:David Fan, Jue Wang, Shuai Liao, Yi Zhu, Vimal Bhat, Hector Santos-Villalobos, Rohith MV, Xinyu Li

View PDF

Abstract:Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.

Comments:	Accepted to ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.12962 [cs.CV]
	(or arXiv:2308.12962v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.12962

Submission history

From: David Fan [view email]
[v1] Thu, 24 Aug 2023 17:58:04 UTC (595 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Motion-Guided Masking for Spatiotemporal Representation Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Motion-Guided Masking for Spatiotemporal Representation Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators