Table of Links
Abstract and 1 Introduction
2. Related Work
3. Method and 3.1. Architecture
3.2. Loss and 3.3. Implementation Details
4. Data Curation
4.1. Training Dataset
4.2. Evaluation Benchmark
5. Experiments and 5.1. Metrics
5.2. Baselines
5.3. Comparison to SOTA Methods
5.4. Qualitative Results and 5.5. Ablation Study
6. Limitations and Discussion
7. Conclusion and References
A. Additional Qualitative Comparison
B. Inference on AI-generated Images
C. Data Curation Details
Estimating the 3D shape of an object from a single is a complex inverse problem: while the shape of the visible object can be estimated from shading, estimating the shape of the occluded portion requires prior knowledge about object geometry. This is one of the marvels of human perception and achieving it computationally is a major goal for our field. We review regression and generative methods for this task. Regression-based Methods.
These works investigate different ways to represent 3D object shapes and the architectures to produce them from a single image, e.g., meshes [23, 58, 62] or implicit representations like discrete [8, 54] or continuous [35, 39] occupancy, signed distance fields [21, 56, 68], point clouds [2, 14], or sets of parametric surfaces [15, 69]. A major limitation of these works is the limited generalization beyond the categories of the training set.
The improvements of decomposing the problem into predicting the depth and then estimating the complete shape [50, 56, 64, 67, 70], and representing 3D in a viewer centered rather than object centered reference frame [50, 56, 70] allowed for improved zero-shot generalization. Most architectures follow an encoder/decoder design, where the encoder produces a feature map from which the decoder predicts the 3D shape.
While early works produced a single feature vector for the entire image, it was later identified that using local features from a 2D feature map improved the detail of the predicted shapes [58, 68] and improved generalization to unseen categories [5, 67]. This culminated with the current state-of-the art regression method, MCC [63], which takes an RGB-D image as input, and uses a transformer-based encoder-decoder setup to produce a “shell occupancy” prediction [4]. Our approach incorporates all these prior findings for improved generalization, and builds upon them with a new module for estimating the visible shape of the object that estimates depth and camera intrinsics, which is processed with a cross attention-based decoder to produce an occupancy prediction.
3D Generative Methods This category of methods does zero-shot 3D shape reconstruction using a learned 3D generative prior, where the 3D generation is conditioned on one or few input images. Given image or text conditioning, early work [65] used GANs to generate voxels, whereas more recent works use diffusion models to generate point clouds [37], or function parameters for implicit 3D representations [22]. Another related type of generative framing is conditional view synthesis.
Works in this direction fine-tune 2D generative models [31], or train them from scratch [60, 71], to synthesize novel views conditioned on single images and viewing angles. This results in an implicit 3D prior, from which a 3D shape can then be extracted by fitting a 3D neural representation to the synthesized images, or predicting its parameters [30].
3D from 2D Generative Models There have been efforts to use the real-world 2D image prior from text-to-2D generative models [1, 44, 48, 49] to reconstruct 3D shape from a single image. This category of works [11, 34, 53] often uses techniques such as the SDS loss [40, 57] and generates 3D assets from images by optimizing for each object separately. The prolonged optimization time prevents these works from being evaluated at scale or applied in many realworld applications. Orthogonal to the optimization-based approaches, we focus on learning a 3D shape prior that generalizes across instance. We do not perform any perinstance optimization at test time.
[4] Traditionally occupancy is formulated as predicting whether a point in 3D is inside/outside a watertight mesh, whereas MCC predicts whether it is within an ε wide shell representing the surface of the object.
Authors:
(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;
(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;
(3) Anh Thai, Georgia Institute of Technology;
(4) Varun Jampani, Stability AI;
(5) James M. Rehg, University of Illinois at Urbana-Champaign.