Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning

Zhang, Zijian; Liu, Wei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.05318 (cs)

[Submitted on 8 Jun 2024]

Title:Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning

Authors:Zijian Zhang, Wei Liu

View PDF

Abstract:In this paper, we present our solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024. Unlike traditional visual questions and answer tasks, this challenge evaluates abstraction, deduction and generalization ability of neural network in solving visuo-linguistic puzzles designed for specially children in the 6-8 age group. Our model is based on two pre-trained models, dedicated to extract features from text and image respectively. To integrate the features from different modalities, we employed a fusion layer with attention mechanism. We explored different text and image pre-trained models, and fine-tune the integrated classifier on the SMART-101 dataset. Experiment results show that under the data splitting style of puzzle split, our proposed integrated classifier achieves superior performance, verifying the effectiveness of multi-modal pre-trained representations.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.05318 [cs.CV]
	(or arXiv:2406.05318v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.05318

Submission history

From: Zijian Zhang [view email]
[v1] Sat, 8 Jun 2024 01:45:06 UTC (684 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators