FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

Corley, Isaac; Nsutezo, Simone Fobi; Ortiz, Anthony; Robinson, Caleb; Dodhia, Rahul; Ferres, Juan M. Lavista; Najafirad, Peyman

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.08490 (cs)

[Submitted on 14 Jan 2025]

Title:FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

Authors:Isaac Corley, Simone Fobi Nsutezo, Anthony Ortiz, Caleb Robinson, Rahul Dodhia, Juan M. Lavista Ferres, Peyman Najafirad

View PDF HTML (experimental)

Abstract:Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6\% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2501.08490 [cs.CV]
	(or arXiv:2501.08490v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.08490

Submission history

From: Isaac Corley [view email]
[v1] Tue, 14 Jan 2025 23:31:20 UTC (1,884 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators