VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Dao, Alan; Buppodom, Norapat

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.21214 (cs)

[Submitted on 27 Mar 2025]

Title:VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Authors:Alan Dao (Gia Tuan Dao), Norapat Buppodom

View PDF HTML (experimental)

Abstract:Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2503.21214 [cs.CV]
	(or arXiv:2503.21214v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.21214

Submission history

From: Alan Dao [view email]
[v1] Thu, 27 Mar 2025 07:07:11 UTC (2,250 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators