Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

Perry, Julian; Siripong, Surasakdi; Phonchai, Thanakorn

Computer Science > Computation and Language

arXiv:2501.08597 (cs)

[Submitted on 15 Jan 2025]

Title:Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

Authors:Julian Perry, Surasakdi Siripong, Thanakorn Phonchai

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models. Furthermore, human evaluations highlight the superior correctness and relevance of our model's outputs. Extensive analyses confirm the robustness, efficiency, and scalability of AKGP-LVLM, making it a compelling solution for real-world knowledge-intensive tasks.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.08597 [cs.CL]
	(or arXiv:2501.08597v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.08597

Submission history

From: Surasakdi Siripong [view email]
[v1] Wed, 15 Jan 2025 05:45:04 UTC (31 KB)

Computer Science > Computation and Language

Title:Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators