World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ma, Ziqiao; Pan, Jiayi; Chai, Joyce

Computer Science > Computation and Language

arXiv:2306.08685 (cs)

[Submitted on 14 Jun 2023 (v1), last revised 26 Dec 2024 (this version, v2)]

Title:World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Authors:Ziqiao Ma, Jiayi Pan, Joyce Chai

View PDF HTML (experimental)

Abstract:The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly. Our code is available at this https URL

Comments:	ACL 2023 Outstanding Paper
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.08685 [cs.CL]
	(or arXiv:2306.08685v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.08685

Submission history

From: Ziqiao Ma [view email]
[v1] Wed, 14 Jun 2023 18:10:05 UTC (1,278 KB)
[v2] Thu, 26 Dec 2024 19:50:42 UTC (1,306 KB)

Computer Science > Computation and Language

Title:World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators