Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning

Noever, David; Noever, Samantha Elizabeth Miller

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.16705 (cs)

[Submitted on 17 Aug 2023 (v1), last revised 14 Oct 2023 (this version, v2)]

Title:Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning

Authors:David Noever, Samantha Elizabeth Miller Noever

View PDF

Abstract:Addressing the gap in understanding visual comprehension in Large Language Models (LLMs), we designed a challenge-response study, subjecting Google Bard and GPT-Vision to 64 visual tasks, spanning categories like "Visual Situational Reasoning" and "Next Scene Prediction." Previous models, such as GPT4, leaned heavily on optical character recognition tools like Tesseract, whereas Bard and GPT-Vision, akin to Google Lens and Visual API, employ deep learning techniques for visual text recognition. However, our findings spotlight both vision-language model's limitations: while proficient in solving visual CAPTCHAs that stump ChatGPT alone, it falters in recreating visual elements like ASCII art or analyzing Tic Tac Toe grids, suggesting an over-reliance on educated visual guesses. The prediction problem based on visual inputs appears particularly challenging with no common-sense guesses for next-scene forecasting based on current "next-token" multimodal models. This study provides experimental insights into the current capacities and areas for improvement in multimodal LLMs.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2309.16705 [cs.CV]
	(or arXiv:2309.16705v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.16705

Submission history

From: David Noever [view email]
[v1] Thu, 17 Aug 2023 03:14:00 UTC (1,000 KB)
[v2] Sat, 14 Oct 2023 19:53:39 UTC (1,254 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators