This AI Mapping System Lets Robots See The World | HackerNoon

:::info
Authors:

(1) Jiacui Huang, Senior, IEEE;

(2) Hongtao Zhang, Senior, IEEE;

(3) Mingbo Zhao, Senior, IEEE;

(4) Wu Zhou, Senior, IEEE.

:::

Table of Links

Abstract and I. Introduction

II. Related Work

III. Method

IV. Experiment

V. Conclusion, Acknowledgements, and References

VI. Appendix

Abstract—Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings. Recent studies aim to handle this task by constructing the semantic spatial map representation of the environment, and then leveraging the strong ability of reasoning in large language models for generalizing code for guiding the robot navigation. However, these methods face limitations in instance-level and attribute-level navigation tasks as they cannot distinguish different instances of the same object. To address this challenge, we propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping, where it is autonomously constructed by fusing the RGBD video data collected from the robot agent with special-designed natural language map indexing in the bird’s-in-eye view. Such indexing is instancelevel and attribute-level. In particular, when integrated with a large language model, IVLMap demonstrates the capability to i) transform natural language into navigation targets with instance and attribute information, enabling precise localization, and ii) accomplish zero-shot end-to-end navigation tasks based on natural language commands. Extensive navigation experiments are conducted. Simulation results illustrate that our method can achieve an average improvement of 14.4% in navigation accuracy. Code and demo are released at https://ivlmap.github.io/.

I. INTRODUCTION

DEVELOPING a robot that can cooperate with humans is of great importance in many real-world applications, such as Telsa Optimus, Mobile ALOHA [1], etc. A robot agent that can understand human language and navigate intelligently is therefore significant to benefit human society. Towards this end, Vision-and-Language Navigation (VLN) has been developed during the past few years, which is to empower a robot agent to navigate in photo-realistic environments according to natural language instructions, such as ”navigate to the 3rd chair” or ”navigate to the yellow sofa” [2], [3]. This requires the robot agent can interpret natural language from humans, perceive its visual environment, and utilize the information to navigate, where a key important issue is how to structure the visited environment and make global planning. To handle this case, a few recent approaches [4] utilize the topological map for structure. But these methods are difficult to represent the spatial relations among objects, causing detailed information may be lost; on the other hand, more recent works [5] model the navigation environment using the top-down semantic map, which represents spatial relations more precisely. However, the semantic concepts are extremely limited due to the pre-defined semantic labels.

In general, based on the environments for navigation, the VLN can be roughly divided into two categories, which include navigation in discrete and continuous environments. In discrete environments, such as those in R2R [2], REVERIE [6], and SOON [7], the VLN is conceptualized as a topology structure comprised of interconnected navigable nodes. The agent utilizes a connectivity graph to move between adjacent nodes by selecting a direction from the available navigable directions; in contrast, VLN in continuous environments, such as those in R2R-CE [8] and RxRCE [9], enable the agents to have the flexibility to navigate to any unobstructed point using a set of low-level actions (e.g., move forward 0.25m, turn left 15 degrees) instead of teleporting between fixed nodes. This approach is closer to real-world robot navigation, posing a more intricate challenge for the agent.

Take the instance illustrated in Fig.1, where the language instruction is to ”navigate to the fourth black chair across from the table”. To execute this command, we initially explore the entire room to identify all instances of tables, extracting their color attributes. Subsequently, we locate the position of the fourth table with the specified black color mentioned in the command. Completing such a navigation task becomes easily achievable if we have a global map of the scene, and the map should encompass information about each object, including its category, details, color, etc. Recently, VLMap [5] pioneers innovative zero-shot spatial navigation by constructing Semantic maps via indexing visual landmarks using natural language. However, VLMap is limited to navigating to the vicinity of the closest category to the robotic agent and cannot fulfill the envisioned, more common, and precise instance-level and attribute-level navigation needs in real-life scenarios.

In this paper, to address this challenge, we propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping, where the IVLMap is autonomously constructed by fusing the RGBD video data with a specially-designed natural language map indexing in the bird’s-in-eye view. Such indexing is instance-level and attribute-level. In this way, IVLMap can well separate different instances within the same category. When integrated with a large language model, it demonstrates the capability to i) transform natural language into navigation targets with instance and attribute information, enabling precise localization, and ii) accomplish zero-shot end-to-end navigation tasks based on natural language commands. The main contributions of the proposed work are as follows:

a novel instance-level and attribute-level map in bird’sin-eye review is developed for better comprehend the environment and robot navigation planning, where the instances in map are obtained by involving region matching and label scoring to achieve the category label for each SAM [10]-segmented mask, while the color attribute labels for each mask are also obtained by using the similar approach.
leveraging IVLMap, we propose a two-step method for locating landmarks. This involves an initial coarsegrained localization of the corresponding mask, followed by a fine-grained localization and navigation within the mask’s designated region.
we established an interactive data collection platform, achieving real-time controllable data acquisition. This not only reduced data volume but also enhanced reconstruction efficiency. Furthermore, we conducted experiments in a real-world environment, providing valuable insights for the practical deployment of this method.

The rest of this work is organized as follows: In Section II, we will review some related work; in Section III, we will give the detailed description for presenting the proposed IVLMAP. Extensive simulations are conducted in Section IV and the final conclusions are drawn in Section V.

II. RELATED WORK

Semantic Mapping. Advances in semantic mapping, driven by the integration of Convolutional Neural Networks (CNNs) [11] and dense Simultaneous Localization and Mapping (SLAM) [12], have progressed significantly. SLAM++ [13] introduced an object-oriented SLAM approach leveraging preexisting knowledge. Subsequent studies, e.g., [14], use MaskRCNN for instance-level semantics in 3D volumetric maps. VLMaps [5] and NLMaps-Saycan [15] introduce scene representations queryable through natural language, using Visual Language Models (VLMs). While prior studies focus on semantic details, our work emphasizes pixel-level segmentation accuracy, categorizing each pixel precisely.

Instance Segmentation. Achieving instance-level segmentation is crucial in VLN, requiring precise identification and location of individual instances of similar objects. Innovative research [16] addresses human instance segmentation without distinct detection stages. Real-time solutions [17] introduce a Spatial Attention-Guided Mask (SAG-Mask) branch. Meta AI’s Segment Anything Model (SAM) [10] excels in promptable segmentation, enabling zero-shot generalization to unfamiliar objects. SAM’s object mask segmentation, while powerful, lacks individual mask labeling, limiting practical applications.

Vision-and-Language Navigation(VLN). Recent years have seen significant strides in VLN, driven by researchers such as Y. Long et al., who employed large-scale training to imbue extensive domain knowledge in models like VIMNet [18]. Addressing fusion challenges, VIM-Net critically matches visual and linguistic information for accurate navigation [19]. Innovations like Talk2Nav, with dual attention mechanisms and spatial memory, tackle long-range VLN [20]. Challenges persist, including navigating unseen objects, interpreting language descriptions precisely, personalizing navigation for object attributes, and managing data-intensive requirements for end-to-end navigation [5], [21], [22].

LLM used in VLN. Large Language Models (LLMs) play a pivotal role in VLN, evident in NavGPT’s reasoning capabilities [23]. LM-Nav uses GPT-3 for parsing instructions and navigating landmarks [4]. VELMA extends LLMs to streetlevel navigation, enriching AI in complex urban environments [24]. Studies like Vemprala et al.’s integration of ChatGPT into robotics applications and PaLM-E’s multimodal reasoning with visual and textual inputs [25], [26] influence our project, where we achieved accurate navigation using LLMs guided by natural language.

:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::