An Empirical Study of Autoregressive Pre-training from Videos

Rajasegaran, Jathushan; Radosavovic, Ilija; Ravishankar, Rahul; Gandelsman, Yossi; Feichtenhofer, Christoph; Malik, Jitendra

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.05453 (cs)

[Submitted on 9 Jan 2025]

Title:An Empirical Study of Autoregressive Pre-training from Videos

Authors:Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik

View PDF HTML (experimental)

Abstract:We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.05453 [cs.CV]
	(or arXiv:2501.05453v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.05453

Submission history

From: Jathushan Rajasegaran [view email]
[v1] Thu, 9 Jan 2025 18:59:58 UTC (31,625 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:An Empirical Study of Autoregressive Pre-training from Videos

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:An Empirical Study of Autoregressive Pre-training from Videos

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators