Table of Links
Abstract and 1 Introduction
-
Related Works
2.1. Vision-and-Language Navigation
2.2. Semantic Scene Understanding and Instance Segmentation
2.3. 3D Scene Reconstruction
-
Methodology
3.1. Data Collection
3.2. Open-set Semantic Information from Images
3.3. Creating the Open-set 3D Representation
3.4. Language-Guided Navigation
-
Experiments
4.1. Quantitative Evaluation
4.2. Qualitative Results
-
Conclusion and Future Work, Disclosure statement, and References
2.3. 3D Scene Reconstruction
In recent times, 3D scene reconstruction has seen significant advancements. Some recent works in this field include using a self-supervised approach for Semantic Geometry completion and appearance reconstruction from RGB-D scans such as [26], which uses 3D encoder-decoder architecture for geometry and colour. For these approaches, the focus is on generating semantic reconstruction without ground truth. Another approach is to integrate real-time 3D reconstruction with SLAM. This is done through keyframe-based techniques and has been used in recent autonomous navigation and AR use cases[27]. Another recent method has seen work on Neural Radiance Fields[28] for indoor spaces when utilizing structure-from-motion to understand camera-captured scenes. These NeRF models are trained for each location and are particularly good for spatial understanding. Another method is to build 3D scene graphs using open vocabulary and foundational models like CLIP to capture semantic relationships between objects and their visual representations[4]. During reconstruction, they use the features extracted from the 3D point clouds and project them onto the embedding space learned by CLIP.
This work uses an open-set 2D instance segmentation method, as explained in the previous sections. Given an RGB-D image, we get these individual object masks from the RGB image and back-project them to 3D using the Depth image. Here, we have an instance-based approach instead of having a point-by-point computation to reconstruct, which was previously done by Concept-Fusion [29]. This per-object feature mask extraction also helps us compute embeddings, which preserve the open-set nature of this pipeline.
:::info
Authors:
(1) Laksh Nanwani, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;
(2) Kumaraditya Gupta, International Institute of Information Technology, Hyderabad, India;
(3) Aditya Mathur, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;
(4) Swayam Agrawal, International Institute of Information Technology, Hyderabad, India;
(5) A.H. Abdul Hafez, Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey;
(6) K. Madhava Krishna, International Institute of Information Technology, Hyderabad, India.
:::
:::info
This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.
:::
