Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis

Chowdhury, Arpita; Paul, Dipanjyoti; Mai, Zheda; Gu, Jianyang; Zhang, Ziheng; Mehrab, Kazi Sajeed; Campolongo, Elizabeth G.; Rubenstein, Daniel; Stewart, Charles V.; Karpatne, Anuj; Berger-Wolf, Tanya; Su, Yu; Chao, Wei-Lun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.09333 (cs)

[Submitted on 16 Jan 2025]

Title:Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis

Authors:Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao

View PDF HTML (experimental)

Abstract:We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.09333 [cs.CV]
	(or arXiv:2501.09333v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.09333

Submission history

From: Wei-Lun Chao [view email]
[v1] Thu, 16 Jan 2025 07:07:41 UTC (7,390 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators