X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

Ma, Yiwei; Fan, Yijun; Ji, Jiayi; Wang, Haowei; Sun, Xiaoshuai; Jiang, Guannan; Shu, Annan; Ji, Rongrong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.00085 (cs)

[Submitted on 30 Nov 2023 (v1), last revised 30 Jul 2024 (this version, v3)]

Title:X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

Authors:Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji

View PDF HTML (experimental)

Abstract:In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: this https URL .

Comments:	ToMM24
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.00085 [cs.CV]
	(or arXiv:2312.00085v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.00085

Submission history

From: Yiwei Ma [view email]
[v1] Thu, 30 Nov 2023 07:23:00 UTC (12,324 KB)
[v2] Mon, 25 Dec 2023 05:46:18 UTC (12,323 KB)
[v3] Tue, 30 Jul 2024 06:17:08 UTC (16,932 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators