LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Huang, Haiwen; Chen, Anpei; Havrylov, Volodymyr; Geiger, Andreas; Zhang, Dan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.14032 (cs)

[Submitted on 18 Apr 2025]

Title:LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Authors:Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang

View PDF HTML (experimental)

Abstract:Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Cite as:	arXiv:2504.14032 [cs.CV]
	(or arXiv:2504.14032v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.14032

Submission history

From: Haiwen Huang [view email]
[v1] Fri, 18 Apr 2025 18:46:08 UTC (36,746 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators