Visual Position Prompt for MLLM based Visual Grounding

Tang, Wei; Sun, Yanpeng; Gu, Qinying; Li, Zechao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.15426v1 (cs)

[Submitted on 19 Mar 2025 (this version), latest version 24 Mar 2025 (v2)]

Title:Visual Position Prompt for MLLM based Visual Grounding

Authors:Wei Tang, Yanpeng Sun, Qinying Gu, Zechao Li

View PDF HTML (experimental)

Abstract:Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples). The code and VPP-SFT dataset will be available at this https URL upon acceptance.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.15426 [cs.CV]
	(or arXiv:2503.15426v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.15426

Submission history

From: Wei Tang [view email]
[v1] Wed, 19 Mar 2025 17:08:13 UTC (7,637 KB)
[v2] Mon, 24 Mar 2025 16:34:55 UTC (7,637 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Position Prompt for MLLM based Visual Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Position Prompt for MLLM based Visual Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators