Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

Wizadwongsa, Suttisak; Zhou, Jinfan; Li, Edward; Park, Jeong Joon

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.00651 (cs)

[Submitted on 31 Dec 2024 (v1), last revised 4 Jan 2025 (this version, v2)]

Title:Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

Authors:Suttisak Wizadwongsa, Jinfan Zhou, Edward Li, Jeong Joon Park

View PDF HTML (experimental)

Abstract:Recent AI-based 3D content creation has largely evolved along two paths: feed-forward image-to-3D reconstruction approaches and 3D generative models trained with 2D or 3D supervision. In this work, we show that existing feed-forward reconstruction methods can serve as effective latent encoders for training 3D generative models, thereby bridging these two paradigms. By reusing powerful pre-trained reconstruction models, we avoid computationally expensive encoder network training and obtain rich 3D latent features for generative modeling for free. However, the latent spaces of reconstruction models are not well-suited for generative modeling due to their unstructured nature. To enable flow-based model training on these latent features, we develop post-processing pipelines, including protocols to standardize the features and spatial weighting to concentrate on important regions. We further incorporate a 2D image space perceptual rendering loss to handle the high-dimensional latent spaces. Finally, we propose a multi-stream transformer-based rectified flow architecture to achieve linear scaling and high-quality text-conditioned 3D generation. Our framework leverages the advancements of feed-forward reconstruction models to enhance the scalability of 3D generative modeling, achieving both high computational efficiency and state-of-the-art performance in text-to-3D generation.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2501.00651 [cs.CV]
	(or arXiv:2501.00651v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.00651

Submission history

From: Jinfan Zhou [view email]
[v1] Tue, 31 Dec 2024 21:23:08 UTC (38,797 KB)
[v2] Sat, 4 Jan 2025 08:27:57 UTC (38,797 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators