Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Hao, Dongze; Wang, Qunbo; Guo, Longteng; Jiang, Jie; Liu, Jing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.13947 (cs)

[Submitted on 22 Apr 2024 (v1), last revised 8 Oct 2024 (this version, v3)]

Title:Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Authors:Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu

View PDF HTML (experimental)

Abstract:While large visual-language models (LVLM) have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. However, DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. Thus, the retrieved knowledge is not truly conducive to helping answer the question, affecting the performance of the overall system. To address this issue, we propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions. The framework consists of two modules: Selector and Answerer, where both are initialized by the LVLM and parameter-efficiently finetuned by self-bootstrapping: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%. Our code is publicly available at this https URL.

Comments:	Accepted to EMNLP 2024 Main Conference
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.13947 [cs.CV]
	(or arXiv:2404.13947v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.13947

Submission history

From: Dongze Hao [view email]
[v1] Mon, 22 Apr 2024 07:44:20 UTC (634 KB)
[v2] Sun, 16 Jun 2024 07:04:48 UTC (602 KB)
[v3] Tue, 8 Oct 2024 07:10:20 UTC (603 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators