Table of Links
Abstract and 1 Introduction
2. Related Work
3. Method and 3.1. Architecture
3.2. Loss and 3.3. Implementation Details
4. Data Curation
4.1. Training Dataset
4.2. Evaluation Benchmark
5. Experiments and 5.1. Metrics
5.2. Baselines
5.3. Comparison to SOTA Methods
5.4. Qualitative Results and 5.5. Ablation Study
6. Limitations and Discussion
7. Conclusion and References
A. Additional Qualitative Comparison
B. Inference on AI-generated Images
C. Data Curation Details
Abstract
We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time. In contrast, the traditional approach to this problem is regression-based, where deterministic models are trained to directly regress the object shape. Such regression methods possess much higher computational efficiency than generative methods. This raises a natural question: is generative modeling necessary for high performance, or conversely, are regression-based approaches still competitive?
To answer this, we design a strong regression-based model, called ZeroShape, based on the converging findings in this field and a novel insight. We also curate a large real-world evaluation benchmark, with objects from three different real-world 3D datasets. This evaluation benchmark is more diverse and an order of magnitude larger than what prior works use to quantitatively evaluate their models, aiming at reducing the evaluation variance in our field. We show that ZeroShape not only achieves superior performance over state-of-the-art methods, but also demonstrates significantly higher computational and data efficiency.[1]
1. Introduction
Inferring the properties of individual objects such as their category or 3D shape is a fundamental task in computer vision. The ultimate goal is to do this accurately for any object, generally referred to as zero-shot generalization.
For machine learning methods, this means high accuracy on data distributions that may be significantly different from the training set, such as images of novel types of objects like machine parts or images from uncommon visual contexts like underwater imagery. An object representation capable of zero-shot generalization, therefore, needs to accurately capture the visual properties that are shared across all
objects in the world—an extremely ambitious goal.
Recent work in computer vision has taken the broader challenge of zero-shot generalization head-on, with impressive developments for 2D vision tasks like segmentation [26, 42], visual question answering [3, 27], image generation [1, 48, 49], and in training general vision representations that can be easily adapted for any vision task [38, 43]. This progress has largely been enabled by increasing model size and scaling training dataset size to the order of tens to hundreds of millions of images.
These developments have inspired efforts which aim at zero-shot generalization for single image 3D object shape reconstruction [22, 30, 31, 37]. This is a classical and famously ill-posed problem, with important applications like virtual object placement in scenes in AR and object manipulation in robotics.
These works aim to learn a “zero-shot 3D shape prior” by relying on generative diffusion models for 3D point clouds [37], NeRFs [22], or for 2D images fine tuned for novel-view synthesis [30, 31], enabled by millionscale 3D data curation efforts such as Objaverse [9, 10]. While these methods show impressive zero-shot generalization ability, it comes at a great compute cost due to large model parameter counts and the inference-time sampling required by diffusion models.
Using expensive generative modeling for zero-shot 3D shape from single images diverges from the approach of early deep learning-based works on this task [50, 56, 63, 67, 70]. These works define the task as a 3D occupancy or signed distance regression problem and predict the shape of objects in a single forward pass. This raises a natural question: is generative modeling necessary for high performance at learning zeroshot 3D shape prior, or conversely, can a regression-based approach still be competitive?
In this work, we find that regression approaches are indeed competitive if designed carefully, and computationally more efficient by a large margin compared to the generative counterparts. We propose ZeroShape: a regressionbased 3D shape reconstruction approach that achieves stateof-the-art zero-shot generalization, trained entirely on synthetic data, while requiring a fraction of the compute and data budget of prior work (see Fig. 1). We build our model upon key ingredients that facilitate generalization based on prior works: 1) usage of intermediate geometric representation (e.g. depth) [33, 56, 64, 67, 70], 2) explicit reasoning with local features [5, 63, 68].
Specifically, we decompose the reconstruction into estimating the shape of the visible portion of the object, and then predicting the complete 3D object shape based on this initial prediction. The accurate estimation of the visible 3D surface is enabled by a joint modeling of camera intrinsics and depth, which we find to be essential for high accuracy.
Another thrust of our work is a large benchmark for evaluating zero-shot reconstruction performance. The 3D vision community is working on developing a zero-shot 3D shape prior, but what is the correct way to evaluate our progress? Currently we lack a well-defined benchmark, which has lead to well-curated qualitative results and small scale quantitative results[3] on different datasets across different papers. This makes it difficult to track progress and identify directions for future research.
To resolve this and standardize evaluation, we develop a protocol based on data generated from existing datasets of 3D object assets. Our benchmark includes thousands of common objects from hundreds of different categories and multiple data sources. We consider real images paired with 3D meshes [51, 52], and also generate photorealistic renders of 3D object scans [66]. Our large scale quantitative evaluation provides a rigorous perspective on the current state-of-the-art.
In summary, our contributions are:
• ZeroShape: A regression-based zero-shot 3D shape reconstruction method with state-of-the-art performance at a fraction of the compute and data budget of prior work.
• A unified large-scale evaluation benchmark for zero-shot 3D shape reconstruction, generated by standardized processing and rendering of existing 3D datasets.
[1] Project website at: https://zixuanh.com/projects/zeroshape.html
[2] We use 3M as a reference value. Point-E [37] and Shape-E [22] state a dataset size of “several million”.
[3] On the order of hundreds of objects from tens of categories at best, to just a few dozen objects at worst.
Authors:
(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;
(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;
(3) Anh Thai, Georgia Institute of Technology;
(4) Varun Jampani, Stability AI;
(5) James M. Rehg, University of Illinois at Urbana-Champaign.