Towards Generalist Robot Learning from Internet Video: A Survey

McCarthy, Robert; Tan, Daniel C. H.; Schmidt, Dominik; Acero, Fernando; Herr, Nathan; Du, Yilun; Thuruthel, Thomas G.; Li, Zhibin

Computer Science > Robotics

arXiv:2404.19664v2 (cs)

[Submitted on 30 Apr 2024 (v1), revised 7 Jun 2024 (this version, v2), latest version 12 Nov 2024 (v4)]

Title:Towards Generalist Robot Learning from Internet Video: A Survey

Authors:Robert McCarthy, Daniel C.H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li

View PDF HTML (experimental)

Abstract:This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large internet video datasets and, in the process, extracting foundational knowledge about the world's dynamics and physical human behaviour. Such methods hold great promise for developing general-purpose robots.
We open with an overview of fundamental concepts relevant to the LfV-for-robotics setting. This includes a discussion of the exciting benefits LfV methods can offer (e.g., improved generalization beyond the available robot data) and commentary on key LfV challenges (e.g., missing information in video and LfV distribution shifts). Our literature review begins with an analysis of video foundation model techniques that can extract knowledge from large, heterogeneous video datasets. Next, we review methods that specifically leverage video data for robot learning. Here, we categorise work according to which RL knowledge modality (KM) benefits from the use of video data. We additionally highlight techniques for mitigating LfV challenges, including reviewing action representations that address missing action labels in video.
Finally, we examine LfV datasets and benchmarks, before concluding with a discussion of challenges and opportunities in LfV. Here, we advocate for scalable foundation model approaches that can leverage the full range of internet video data, and that target the learning of the most promising RL KMs: the policy and dynamics model. Overall, we hope this survey will serve as a comprehensive reference for the emerging field of LfV, catalysing further research in the area and facilitating progress towards the development of general-purpose robots.

Comments:	Updated formatting. Reduced paper length and made other minor improvements
Subjects:	Robotics (cs.RO); Machine Learning (cs.LG)
Cite as:	arXiv:2404.19664 [cs.RO]
	(or arXiv:2404.19664v2 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2404.19664

Submission history

From: Robert McCarthy [view email]
[v1] Tue, 30 Apr 2024 15:57:41 UTC (3,889 KB)
[v2] Fri, 7 Jun 2024 09:25:42 UTC (913 KB)
[v3] Mon, 14 Oct 2024 17:41:06 UTC (1,637 KB)
[v4] Tue, 12 Nov 2024 12:43:42 UTC (1,644 KB)

Computer Science > Robotics

Title:Towards Generalist Robot Learning from Internet Video: A Survey

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Towards Generalist Robot Learning from Internet Video: A Survey

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators