A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

Karanfil, Enes; Imamoglu, Nevrez; Erdem, Erkut; Erdem, Aykut

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.10144 (cs)

[Submitted on 17 Jan 2025]

Title:A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

Authors:Enes Karanfil, Nevrez Imamoglu, Erkut Erdem, Aykut Erdem

View PDF HTML (experimental)

Abstract:Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.10144 [cs.CV]
	(or arXiv:2501.10144v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.10144

Submission history

From: Enes Karanfil [view email]
[v1] Fri, 17 Jan 2025 12:12:33 UTC (26,288 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators