Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

Wang, Wenjin; Li, Yunhao; Ou, Yixin; Zhang, Yin

Computer Science > Computation and Language

arXiv:2306.00526v3 (cs)

[Submitted on 1 Jun 2023 (v1), revised 6 Sep 2023 (this version, v3), latest version 7 Sep 2023 (v4)]

Title:Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

Authors:Wenjin Wang, Yunhao Li, Yixin Ou, Yin Zhang

View PDF

Abstract:The pre-training-fine-tuning paradigm based on layout-aware multimodal pre-trained models has achieved significant progress on document image question answering. However, domain pre-training and task fine-tuning for additional visual, layout, and task modules prevent them from directly utilizing off-the-shelf instruction-tuning language foundation models, which have recently shown promising potential in zero-shot learning. Contrary to aligning language models to the domain of document image question answering, we align document image question answering to off-the-shell instruction-tuning language foundation models to utilize their zero-shot capability. Specifically, we propose layout and task aware instruction prompt called LATIN-Prompt, which consists of layout-aware document content and task-aware descriptions. The former recovers the layout information among text segments from OCR tools by appropriate spaces and line breaks. The latter ensures that the model generates answers that meet the requirements, especially format requirements, through a detailed description of task. Experimental results on three benchmarks show that LATIN-Prompt can improve the zero-shot performance of instruction-tuning language foundation models on document image question answering and help them achieve comparable levels to SOTAs based on the pre-training-fine-tuning paradigm. Quantitative analysis and qualitative analysis demonstrate the effectiveness of LATIN-Prompt. We provide the code in supplementary and will release the code to facilitate future research.

Comments:	Add the LATIN-Tuning for Alapca. Code is available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.00526 [cs.CL]
	(or arXiv:2306.00526v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.00526

Submission history

From: Wenjin Wang [view email]
[v1] Thu, 1 Jun 2023 10:28:12 UTC (544 KB)
[v2] Fri, 30 Jun 2023 12:03:58 UTC (549 KB)
[v3] Wed, 6 Sep 2023 03:30:14 UTC (3,836 KB)
[v4] Thu, 7 Sep 2023 08:40:16 UTC (3,836 KB)

Computer Science > Computation and Language

Title:Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators