FLIP Reasoning Challenge

Plesner, Andreas; Kuzhagaliyev, Turlan; Wattenhofer, Roger

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.12256 (cs)

[Submitted on 16 Apr 2025]

Title:FLIP Reasoning Challenge

Authors:Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer

View PDF HTML (experimental)

Abstract:Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at this https URL.

Comments:	Published at First Workshop on Open Science for Foundation Models at ICLR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2504.12256 [cs.CV]
	(or arXiv:2504.12256v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.12256

Submission history

From: Andreas Plesner [view email]
[v1] Wed, 16 Apr 2025 17:07:16 UTC (5,018 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FLIP Reasoning Challenge

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FLIP Reasoning Challenge

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators