Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Wang, Ping; Zhang, Yulun; Wang, Lishun; Yuan, Xin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.11946 (cs)

[Submitted on 16 Jul 2024 (v1), last revised 17 Jul 2024 (this version, v2)]

Title:Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Authors:Ping Wang, Yulun Zhang, Lishun Wang, Xin Yuan

View PDF HTML (experimental)

Abstract:Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN further enhances the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by $\!>\!0.5$ dB with comparable or fewer parameters and complexity. The source codes and pretrained models are released at this https URL.

Comments:	Accepted by ECCV 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.11946 [cs.CV]
	(or arXiv:2407.11946v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.11946

Submission history

From: Ping Wang [view email]
[v1] Tue, 16 Jul 2024 17:35:59 UTC (7,936 KB)
[v2] Wed, 17 Jul 2024 08:07:58 UTC (9,273 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators