Generating Multimodal Images with GAN: Integrating Text, Image, and Style

Tan, Chaoyi; Zhang, Wenqing; Qi, Zhen; Shih, Kowei; Li, Xinshi; Xiang, Ao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.02167 (cs)

[Submitted on 4 Jan 2025]

Title:Generating Multimodal Images with GAN: Integrating Text, Image, and Style

Authors:Chaoyi Tan, Wenqing Zhang, Zhen Qi, Kowei Shih, Xinshi Li, Ao Xiang

View PDF

Abstract:In the field of computer vision, multimodal image generation has become a research hotspot, especially the task of integrating text, image, and style. In this study, we propose a multimodal image generation method based on Generative Adversarial Networks (GAN), capable of effectively combining text descriptions, reference images, and style information to generate images that meet multimodal requirements. This method involves the design of a text encoder, an image feature extractor, and a style integration module, ensuring that the generated images maintain high quality in terms of visual content and style consistency. We also introduce multiple loss functions, including adversarial loss, text-image consistency loss, and style matching loss, to optimize the generation process. Experimental results show that our method produces images with high clarity and consistency across multiple public datasets, demonstrating significant performance improvements compared to existing methods. The outcomes of this study provide new insights into multimodal image generation and present broad application prospects.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.02167 [cs.CV]
	(or arXiv:2501.02167v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.02167

Submission history

From: Ao Xiang [view email]
[v1] Sat, 4 Jan 2025 02:51:28 UTC (1,388 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Generating Multimodal Images with GAN: Integrating Text, Image, and Style

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Generating Multimodal Images with GAN: Integrating Text, Image, and Style

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators