HORT: Monocular Hand-held Objects Reconstruction with Transformers

Chen, Zerui; Potamias, Rolandos Alexandros; Chen, Shizhe; Schmid, Cordelia

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.21313 (cs)

[Submitted on 27 Mar 2025]

Title:HORT: Monocular Hand-held Objects Reconstruction with Transformers

Authors:Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Cordelia Schmid

View PDF HTML (experimental)

Abstract:Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.

Comments:	Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.21313 [cs.CV]
	(or arXiv:2503.21313v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.21313

Submission history

From: Zerui Chen [view email]
[v1] Thu, 27 Mar 2025 09:45:09 UTC (11,153 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HORT: Monocular Hand-held Objects Reconstruction with Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HORT: Monocular Hand-held Objects Reconstruction with Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators