Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Patro, Badri N.; Ranganath, Suhas; Namboodiri, Vinay P.; Agneeswaran, Vijay S.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.18063 (cs)

[Submitted on 26 Mar 2024 (v1), last revised 3 Jun 2024 (this version, v2)]

Title:Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Authors:Badri N. Patro, Suhas Ranganath, Vinay P. Namboodiri, Vijay S. Agneeswaran

View PDF HTML (experimental)

Abstract:Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{this https URL}

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2403.18063 [cs.CV]
	(or arXiv:2403.18063v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.18063

Submission history

From: Badri Narayana Patro [view email]
[v1] Tue, 26 Mar 2024 19:29:21 UTC (3,137 KB)
[v2] Mon, 3 Jun 2024 18:22:30 UTC (4,400 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators