The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

Zhang, Hailiang; Chao, Dian; Guan, Zhihao; Yang, Yang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.01907 (cs)

[Submitted on 2 Jul 2024]

Title:The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

Authors:Hailiang Zhang, Dian Chao, Zhihao Guan, Yang Yang

View PDF HTML (experimental)

Abstract:In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot address questions like "Track the container from which the person pours the first time." To tackle this issue, we propose an alternative two-stage approach:(1) First, we leverage the VALOR model to answer questions based on video information.(2) concatenate the answered questions with their respective answers. Finally, we employ TubeDETR to generate bounding boxes for the targets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2407.01907 [cs.CV]
	(or arXiv:2407.01907v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.01907

Submission history

From: Yang Yang [view email]
[v1] Tue, 2 Jul 2024 03:13:27 UTC (990 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2024-07

Change to browse by:

cs
cs.LG

References & Citations

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators