Table of Links
Abstract and 1 Introduction
2. Related Works
2.1. 2D Diffusion Models for 3D Generation
2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models
3. Problem Formulation
3.1. Diffusion Models
3.2. The Distribution of 3D Assets
4. Method and 4.1. Consistent Multi-view Generation
4.2. Cross-Domain Diffusion
4.3. Textured Mesh Extraction
5. Experiments
5.1. Implementation Details
5.2. Baselines
5.3. Evaluation Protocol
5.4. Single View Reconstruction
5.5. Novel View Synthesis and 5.6. Discussions
6. Conclusions and Future Works, Acknowledgements and References
2.2. 3D Generative Models
Instead of performing a time-consuming per-shape optimization guided by 2D diffusion models, some works attempt to directly train 3D diffusion models based on various 3D representations, like point clouds [37, 41, 71, 75], meshes [16, 34], neural fields [1, 4, 7, 14, 17, 21, 25– 27, 40, 42, 61, 72] However, due to the limited size of public available 3D assets dataset, most of the works have only been validated on limited categories of shapes, and how to scale up on large datasets is still an open problem. On the contrary, our method adopts 2D representations and, thus, can be built upon the 2D diffusion models [47] whose pretrained priors significantly facilitate zero-shot generalization ability.
2.3. Multi-view Diffusion Models
To generate consistent multi-view images, some efforts [3, 10, 18, 28, 32, 53, 55, 56, 58, 64, 66, 68, 70, 76] are made to extend 2D diffusion models from single-view images to multi-view images. However, most of these methods focus on image generation and are not designed for 3D reconstruction. The works [66, 73] first warp estimated depth maps to produce incomplete novel view images to then perform inpainting on them, but their result quality significantly degrades when the depth maps estimated by external depth estimation models are inaccurate. The recent works Viewset Diffusion [53], SyncDreamer [33], and MVDream [51] share a similar idea to produce consistent multiview color images via attention layers. However, unlike that normal maps explicitly encode geometric information, reconstruction from color images always suffers from texture ambiguity, and, thus, they either struggle to recover geometric details or require huge computational costs. SyncDreamer [33] requires dense views for 3D reconstruction, but still suffers from low-quality geometry and blurring textures. MVDream [51] still resorts to a time-consuming optimization using SDS loss for 3D reconstruction, and its multi-view distillation scheme requires 1.5 hours. In contrast, our method can reconstruct high-quality textured meshes in just 2 minutes.
Authors:
(1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions;
(2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions;
(3) Cheng Lin, The University of Hong Kong with Corresponding authors;
(4) Yuan Liu, The University of Hong Kong;
(5) Zhiyang Dou, The University of Hong Kong;
(6) Lingjie Liu, University of Pennsylvania;
(7) Yuexin Ma, Shanghai Tech University;
(8) Song-Hai Zhang, The University of Hong Kong;
(9) Marc Habermann, MPI Informatik;
(10) Christian Theobalt, MPI Informatik;
(11) Wenping Wang, Texas A&M University with Corresponding authors.