Perception of Visual Content: Differences Between Humans and Foundation Models

Pratama, Nardiena A.; Fan, Shaoyang; Demartini, Gianluca

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.18968 (cs)

[Submitted on 28 Nov 2024 (v1), last revised 26 Mar 2025 (this version, v2)]

Title:Perception of Visual Content: Differences Between Humans and Foundation Models

Authors:Nardiena A. Pratama, Shaoyang Fan, Gianluca Demartini

View PDF HTML (experimental)

Abstract:Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts. We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. We compare human and ML-generated annotations semantically and evaluate their impact on predictive models. Our results show highest similarity between ML captions and human labels from a low-level perspective, i.e., types of words that appear and sentence structures, but all three annotations are alike in how similar or dissimilar they perceive images across different regions. Additionally, ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. The varying performance of annotation sets highlights the notion that all annotations are important, and that human-generated annotations are yet to be replaceable.

Comments:	12 pages, 5 figures, 5 tables; updated version for a Revise-and-Resubmit at ICWSM 2025. This version includes a larger and more diverse dataset, leading to updated results
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2411.18968 [cs.CV]
	(or arXiv:2411.18968v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.18968

Submission history

From: Nardiena A. Pratama [view email]
[v1] Thu, 28 Nov 2024 07:37:04 UTC (676 KB)
[v2] Wed, 26 Mar 2025 13:02:34 UTC (5,001 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Perception of Visual Content: Differences Between Humans and Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Perception of Visual Content: Differences Between Humans and Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators