Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

Barriere, Valentin; del Rio, Felipe; De Ferari, Andres Carvallo; Aspillaga, Carlos; Herrera-Berg, Eugenio; Calderon, Cristian Buc

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.15991 (cs)

[Submitted on 27 Sep 2023 (v1), last revised 17 Nov 2023 (this version, v2)]

Title:Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

Authors:Valentin Barriere, Felipe del Rio, Andres Carvallo De Ferari, Carlos Aspillaga, Eugenio Herrera-Berg, Cristian Buc Calderon

View PDF

Abstract:Artificial neural networks typically struggle in generalizing to out-of-context examples. One reason for this limitation is caused by having datasets that incorporate only partial information regarding the potential correlational structure of the world. In this work, we propose TIDA (Targeted Image-editing Data Augmentation), a targeted data augmentation method focused on improving models' human-like abilities (e.g., gender recognition) by filling the correlational structure gap using a text-to-image generative model. More specifically, TIDA identifies specific skills in captions describing images (e.g., the presence of a specific gender in the image), changes the caption (e.g., "woman" to "man"), and then uses a text-to-image model to edit the image in order to match the novel caption (e.g., uniquely changing a woman to a man while maintaining the context identical). Based on the Flickr30K benchmark, we show that, compared with the original data set, a TIDA-enhanced dataset related to gender, color, and counting abilities induces better performance in several image captioning metrics. Furthermore, on top of relying on the classical BLEU metric, we conduct a fine-grained analysis of the improvements of our models against the baseline in different ways. We compared text-to-image generative models and found different behaviors of the image captioning models in terms of encoding visual encoding and textual decoding.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2309.15991 [cs.CV]
	(or arXiv:2309.15991v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.15991

Submission history

From: Eugenio Herrera-Berg [view email]
[v1] Wed, 27 Sep 2023 20:12:41 UTC (5,323 KB)
[v2] Fri, 17 Nov 2023 15:47:35 UTC (5,325 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators