Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Bigverdi, Mahtab; Luo, Zelun; Hsieh, Cheng-Yu; Shen, Ethan; Chen, Dongping; Shapiro, Linda G.; Krishna, Ranjay

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.03548 (cs)

[Submitted on 4 Dec 2024 (v1), last revised 8 Dec 2024 (this version, v2)]

Title:Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Authors:Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna

View PDF HTML (experimental)

Abstract:Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.03548 [cs.CV]
	(or arXiv:2412.03548v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.03548

Submission history

From: Mahtab Bigverdi [view email]
[v1] Wed, 4 Dec 2024 18:45:35 UTC (7,708 KB)
[v2] Sun, 8 Dec 2024 05:18:30 UTC (7,707 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators