Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Shang, Jinghuan; Schmeckpeper, Karl; May, Brandon B.; Minniti, Maria Vittoria; Kelestemur, Tarik; Watkins, David; Herlant, Laura

Computer Science > Robotics

arXiv:2407.20179 (cs)

[Submitted on 29 Jul 2024 (v1), last revised 10 Oct 2024 (this version, v2)]

Title:Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Authors:Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant

View PDF HTML (experimental)

Abstract:Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at this https URL.

Comments:	CoRL 2024
Subjects:	Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2407.20179 [cs.RO]
	(or arXiv:2407.20179v2 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2407.20179

Submission history

From: Jinghuan Shang [view email]
[v1] Mon, 29 Jul 2024 17:08:21 UTC (16,424 KB)
[v2] Thu, 10 Oct 2024 17:27:46 UTC (20,020 KB)

Computer Science > Robotics

Title:Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators