VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Han, Soyeon Caren; Long, Siqu; Luo, Siwen; Wang, Kunze; Poon, Josiah

Computer Science > Computer Vision and Pattern Recognition

arXiv:2010.03182v2 (cs)

[Submitted on 7 Oct 2020 (v1), revised 14 Oct 2020 (this version, v2), latest version 25 Oct 2020 (v3)]

Title:VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Authors:Soyeon Caren Han, Siqu Long, Siwen Luo, Kunze Wang, Josiah Poon

View PDF

Abstract:Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image this http URL is easily added to existing models and improves across both quantitative and qualitative aspects.

Comments:	Accepted by COLING 2020
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2010.03182 [cs.CV]
	(or arXiv:2010.03182v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2010.03182

Submission history

From: Siqu Long [view email]
[v1] Wed, 7 Oct 2020 05:25:30 UTC (6,323 KB)
[v2] Wed, 14 Oct 2020 12:20:43 UTC (6,323 KB)
[v3] Sun, 25 Oct 2020 05:21:52 UTC (6,323 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators