ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Rawte, Vipula; Jain, Sarthak; Sinha, Aarush; Kaushik, Garv; Bansal, Aman; Vishwanath, Prathiksha Rumale; Jain, Samyak Rajesh; Reganti, Aishwarya Naresh; Jain, Vinija; Chadha, Aman; Sheth, Amit P.; Das, Amitava

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.10867 (cs)

[Submitted on 16 Nov 2024 (v1), last revised 19 Mar 2025 (this version, v2)]

Title:ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Authors:Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das

View PDF HTML (experimental)

Abstract:Recent advances in Large Multimodal Models (LMMs) have expanded their capabilities to video understanding, with Text-to-Video (T2V) models excelling in generating videos from textual prompts. However, they still frequently produce hallucinated content, revealing AI-generated inconsistencies. We introduce ViBe (this https URL a large-scale dataset of hallucinated videos from open-source T2V models. We identify five major hallucination types: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. Using ten T2V models, we generated and manually annotated 3,782 videos from 837 diverse MS COCO captions. Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. ViBe serves as a critical resource for evaluating T2V reliability and advancing hallucination detection. We establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). While initial baselines proposed achieve modest accuracy, this highlights the difficulty of automated hallucination detection and the need for improved methods. Our research aims to drive the development of more robust T2V models and evaluate their outputs based on user preferences.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2411.10867 [cs.CV]
	(or arXiv:2411.10867v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.10867

Submission history

From: Aarush Sinha [view email]
[v1] Sat, 16 Nov 2024 19:23:12 UTC (21,859 KB)
[v2] Wed, 19 Mar 2025 18:53:09 UTC (19,919 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators