Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Hasan, Md Zahid; Chen, Jiajing; Wang, Jiyang; Rahman, Mohammed Shaiqur; Joshi, Ameya; Velipasalar, Senem; Hegde, Chinmay; Sharma, Anuj; Sarkar, Soumik

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.10159 (cs)

[Submitted on 16 Jun 2023 (v1), last revised 21 Mar 2024 (this version, v4)]

Title:Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Authors:Md Zahid Hasan, Jiajing Chen, Jiyang Wang, Mohammed Shaiqur Rahman, Ameya Joshi, Senem Velipasalar, Chinmay Hegde, Anuj Sharma, Soumik Sarkar

View PDF HTML (experimental)

Abstract:Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification tasks and report the results.

Comments:	15 pages, 7 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.10159 [cs.CV]
	(or arXiv:2306.10159v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.10159

Submission history

From: Md Zahid Hasan [view email]
[v1] Fri, 16 Jun 2023 20:02:51 UTC (6,560 KB)
[v2] Thu, 22 Jun 2023 23:11:43 UTC (6,637 KB)
[v3] Thu, 4 Jan 2024 20:23:39 UTC (8,735 KB)
[v4] Thu, 21 Mar 2024 04:17:26 UTC (8,689 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators