Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Dedhia, Bhishma; Jha, Niraj K.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.07887 (cs)

[Submitted on 2 Feb 2024 (v1), last revised 20 Sep 2024 (this version, v2)]

Title:Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Authors:Bhishma Dedhia, Niraj K. Jha

View PDF HTML (experimental)

Abstract:Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is an XML-like schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured objective that reasons over the intermodal alignment. We show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. Finally, we investigate the reasoning abilities of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.07887 [cs.CV]
	(or arXiv:2403.07887v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.07887

Submission history

From: Bhishma Dedhia [view email]
[v1] Fri, 2 Feb 2024 12:37:23 UTC (27,851 KB)
[v2] Fri, 20 Sep 2024 17:55:41 UTC (9,679 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators