PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos

Wei, Kangda; Zhou, Zhengyu; Wang, Bingqing; Araki, Jun; Lange, Lukas; Huang, Ruihong; Feng, Zhe

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.00162 (cs)

[Submitted on 28 Feb 2025]

Title:PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos

Authors:Kangda Wei, Zhengyu Zhou, Bingqing Wang, Jun Araki, Lukas Lange, Ruihong Huang, Zhe Feng

View PDF HTML (experimental)

Abstract:In recent years, online lecture videos have become an increasingly popular resource for acquiring new knowledge. Systems capable of effectively understanding/indexing lecture videos are thus highly desirable, enabling downstream tasks like question answering to help users efficiently locate specific information within videos. This work proposes PreMind, a novel multi-agent multimodal framework that leverages various large models for advanced understanding/indexing of presentation-style videos. PreMind first segments videos into slide-presentation segments using a Vision-Language Model (VLM) to enhance modern shot-detection techniques. Each segment is then analyzed to generate multimodal indexes through three key steps: (1) extracting slide visual content, (2) transcribing speech narratives, and (3) consolidating these visual and speech contents into an integrated understanding. Three innovative mechanisms are also proposed to improve performance: leveraging prior lecture knowledge to refine visual understanding, detecting/correcting speech transcription errors using a VLM, and utilizing a critic agent for dynamic iterative self-reflection in vision analysis. Compared to traditional video indexing methods, PreMind captures rich, reliable multimodal information, allowing users to search for details like abbreviations shown only on slides. Systematic evaluations on the public LPM dataset and an internal enterprise dataset are conducted to validate PreMind's effectiveness, supported by detailed analyses.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Cite as:	arXiv:2503.00162 [cs.CV]
	(or arXiv:2503.00162v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.00162

Submission history

From: Kangda Wei [view email]
[v1] Fri, 28 Feb 2025 20:17:48 UTC (11,217 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators