HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Wang, Xiao; Hua, Jingyun; Lin, Weihong; Zhang, Yuanxing; Zhang, Fuzheng; Wu, Jianlong; Zhang, Di; Nie, Liqiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.20811 (cs)

[Submitted on 28 Feb 2025]

Title:HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Authors:Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie

View PDF HTML (experimental)

Abstract:Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbf{HAICBench} includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2502.20811 [cs.CV]
	(or arXiv:2502.20811v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.20811

Submission history

From: Xiao Wang [view email]
[v1] Fri, 28 Feb 2025 07:53:40 UTC (11,268 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators