ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

Scherl, Alessandro; Thalhammer, Stefan; Neuberger, Bernhard; Wöber, Wilfried; Gracía-Rodríguez, José

Computer Science > Robotics

arXiv:2503.04545 (cs)

[Submitted on 6 Mar 2025]

Title:ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

Authors:Alessandro Scherl, Stefan Thalhammer, Bernhard Neuberger, Wilfried Wöber, José Gracía-Rodríguez

View PDF HTML (experimental)

Abstract:Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2\% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category. Our code and simulation environment are available at: this https URL

Subjects:	Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.04545 [cs.RO]
	(or arXiv:2503.04545v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2503.04545

Submission history

From: Alessandro Scherl [view email]
[v1] Thu, 6 Mar 2025 15:33:19 UTC (3,969 KB)

Computer Science > Robotics

Title:ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators