HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model

Tian, Yu; Shao, Tianqi; Demizu, Tsukasa; Wu, Xuyang; Wu, Hsin-Tai

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.01914 (cs)

[Submitted on 4 Jun 2024]

Title:HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model

Authors:Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu

View PDF HTML (experimental)

Abstract:Head pose estimation (HPE) task requires a sophisticated understanding of 3D spatial relationships and precise numerical output of yaw, pitch, and roll Euler angles. Previous HPE studies are mainly based on Non-large language models (Non-LLMs), which rely on close-up human heads cropped from the full image as inputs and lack robustness in real-world scenario. In this paper, we present a novel framework to enhance the HPE prediction task by leveraging the visual grounding capability of CogVLM. CogVLM is a vision language model (VLM) with grounding capability of predicting object bounding boxes (BBoxes), which enables HPE training and prediction using full image information input. To integrate the HPE task into the VLM, we first cop with the catastrophic forgetting problem in large language models (LLMs) by investigating the rehearsal ratio in the data rehearsal method. Then, we propose and validate a LoRA layer-based model merging method, which keeps the integrity of parameters, to enhance the HPE performance in the framework. The results show our HPE-CogVLM achieves a 31.5\% reduction in Mean Absolute Error for HPE prediction over the current Non-LLM based state-of-the-art in cross-dataset evaluation. Furthermore, we compare our LoRA layer-based model merging method with LoRA fine-tuning only and other merging methods in CogVLM. The results demonstrate our framework outperforms them in all HPE metrics.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2406.01914 [cs.CV]
	(or arXiv:2406.01914v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.01914

Submission history

From: Yu Tian [view email]
[v1] Tue, 4 Jun 2024 02:51:26 UTC (3,374 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators