Contextual Knowledge Pursuit for Faithful Visual Synthesis

Luo, Jinqi; Chan, Kwan Ho Ryan; Dimos, Dimitris; Vidal, René

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.17898 (cs)

[Submitted on 29 Nov 2023 (v1), last revised 5 Nov 2024 (this version, v3)]

Title:Contextual Knowledge Pursuit for Faithful Visual Synthesis

Authors:Jinqi Luo, Kwan Ho Ryan Chan, Dimitris Dimos, René Vidal

View PDF HTML (experimental)

Abstract:Modern text-to-vision generative models often hallucinate when the prompt describing the scene to be generated is underspecified. In large language models (LLMs), a prevalent strategy to reduce hallucinations is to retrieve factual knowledge from an external database. While such retrieval augmentation strategies have great potential to enhance text-to-vision generators, existing static top-K retrieval methods explore the knowledge pool once, missing the broader context necessary for high-quality generation. Furthermore, LLMs internally possess rich world knowledge learned during large-scale training (parametric knowledge) that could mitigate the need for external data retrieval. This paper proposes Contextual Knowledge Pursuit (CKPT), a framework that leverages the complementary strengths of external and parametric knowledge to help generators produce reliable visual content. Instead of the one-time retrieval of facts from an external database to improve a given prompt, CKPT uses (1) an LLM to decide whether to seek external knowledge or to self-elicit descriptions from LLM parametric knowledge, (2) a knowledge pursuit process to contextually seek and sequentially gather most relevant facts, (3) a knowledge aggregator for prompt enhancement with the gathered fact context, and (4) a filtered fine-tuning objective to improve visual synthesis with richer prompts. We evaluate CKPT across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of rare objects and daily scenarios. Our results show that CKPT is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising data source for zero-shot synthesis and filtered fine-tuning of text-to-vision generative models.

Comments:	Accepted in ECCV 2024 SDCV Workshop. GitHub repository at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2311.17898 [cs.CV]
	(or arXiv:2311.17898v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.17898

Submission history

From: Jinqi Luo [view email]
[v1] Wed, 29 Nov 2023 18:51:46 UTC (15,397 KB)
[v2] Thu, 30 Nov 2023 18:59:01 UTC (15,397 KB)
[v3] Tue, 5 Nov 2024 16:31:24 UTC (15,370 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Contextual Knowledge Pursuit for Faithful Visual Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Contextual Knowledge Pursuit for Faithful Visual Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators