AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Suglia, Alessandro; Greco, Claudio; Baker, Katie; Part, Jose L.; Papaioannou, Ioannis; Eshghi, Arash; Konstas, Ioannis; Lemon, Oliver

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.13807 (cs)

[Submitted on 19 Jun 2024 (v1), last revised 21 Jun 2024 (this version, v2)]

Title:AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Authors:Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon

View PDF HTML (experimental)

Abstract:AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.

Comments:	Code available this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2406.13807 [cs.CV]
	(or arXiv:2406.13807v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.13807

Submission history

From: Alessandro Suglia [view email]
[v1] Wed, 19 Jun 2024 20:14:14 UTC (27,906 KB)
[v2] Fri, 21 Jun 2024 09:53:41 UTC (27,906 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators