2-D SSM: A General Spatial Layer for Visual Transformers

Baron, Ethan; Zimerman, Itamar; Wolf, Lior

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.06635 (cs)

[Submitted on 11 Jun 2023]

Title:2-D SSM: A General Spatial Layer for Visual Transformers

Authors:Ethan Baron, Itamar Zimerman, Lior Wolf

View PDF

Abstract:A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance even without positional encoding

Comments:	16 pages, 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
MSC classes:	F.2.2, I.2.7
Cite as:	arXiv:2306.06635 [cs.CV]
	(or arXiv:2306.06635v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.06635

Submission history

From: Itamar Zimerman [view email]
[v1] Sun, 11 Jun 2023 09:41:37 UTC (9,071 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:2-D SSM: A General Spatial Layer for Visual Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:2-D SSM: A General Spatial Layer for Visual Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators