The in-context inductive biases of vision-language models differ across modalities

Allen, Kelsey; Dasgupta, Ishita; Kosoy, Eliza; Lampinen, Andrew K.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.01530 (cs)

[Submitted on 3 Feb 2025]

Title:The in-context inductive biases of vision-language models differ across modalities

Authors:Kelsey Allen, Ishita Dasgupta, Eliza Kosoy, Andrew K. Lampinen

View PDF HTML (experimental)

Abstract:Inductive biases are what allow learners to make guesses in the absence of conclusive evidence. These biases have often been studied in cognitive science using concepts or categories -- e.g. by testing how humans generalize a new category from a few examples that leave the category boundary ambiguous. We use these approaches to study generalization in foundation models during in-context learning. Modern foundation models can condition on both vision and text, and differences in how they interpret and learn from these different modalities is an emerging area of study. Here, we study how their generalizations vary by the modality in which stimuli are presented, and the way the stimuli are described in text. We study these biases with three different experimental paradigms, across three different vision-language models. We find that the models generally show some bias towards generalizing according to shape over color. This shape bias tends to be amplified when the examples are presented visually. By contrast, when examples are presented in text, the ordering of adjectives affects generalization. However, the extent of these effects vary across models and paradigms. These results help to reveal how vision-language models represent different types of inputs in context, and may have practical implications for the use of vision-language models.

Comments:	10 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2502.01530 [cs.CV]
	(or arXiv:2502.01530v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.01530

Submission history

From: Andrew Lampinen [view email]
[v1] Mon, 3 Feb 2025 17:11:03 UTC (42,845 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:The in-context inductive biases of vision-language models differ across modalities

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The in-context inductive biases of vision-language models differ across modalities

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators