A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Zhang, Xin; Han, Liangxiu; Sobeih, Tam; Han, Lianghao; Dancey, Darren

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.17335v2 (cs)

[Submitted on 26 Apr 2024 (v1), revised 1 May 2024 (this version, v2), latest version 24 Feb 2025 (v3)]

Title:A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Authors:Xin Zhang, Liangxiu Han, Tam Sobeih, Lianghao Han, Darren Dancey

View PDF HTML (experimental)

Abstract:Depth estimation is crucial for interpreting complex environments, especially in areas such as autonomous vehicle navigation and robotics. Nonetheless, obtaining accurate depth readings from event camera data remains a formidable challenge. Event cameras operate differently from traditional digital cameras, continuously capturing data and generating asynchronous binary spikes that encode time, location, and light intensity. Yet, the unique sampling mechanisms of event cameras render standard image based algorithms inadequate for processing spike data. This necessitates the development of innovative, spike-aware algorithms tailored for event cameras, a task compounded by the irregularity, continuity, noise, and spatial and temporal characteristics inherent in spiking this http URL the strong generalization capabilities of transformer neural networks for spatiotemporal data, we propose a purely spike-driven spike transformer network for depth estimation from spiking camera data. To address performance limitations with Spiking Neural Networks (SNN), we introduce a novel single-stage cross-modality knowledge transfer framework leveraging knowledge from a large vision foundational model of artificial neural networks (ANN) (DINOv2) to enhance the performance of SNNs with limited data. Our experimental results on both synthetic and real datasets show substantial improvements over existing models, with notable gains in Absolute Relative and Square Relative errors (49% and 39.77% improvements over the benchmark model Spike-T, respectively). Besides accuracy, the proposed model also demonstrates reduced power consumptions, a critical factor for practical applications.

Comments:	16 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.17335 [cs.CV]
	(or arXiv:2404.17335v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.17335

Submission history

From: Xin Zhang [view email]
[v1] Fri, 26 Apr 2024 11:32:53 UTC (5,316 KB)
[v2] Wed, 1 May 2024 08:54:54 UTC (5,319 KB)
[v3] Mon, 24 Feb 2025 10:47:58 UTC (5,011 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators