Abstract
Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios. Code is available at https://github.com/expectorlin/CONSOLE.
Framework
Experiment
Conclusion
In this paper, we propose CONSOLE for VLN, which casts VLN as an open-world sequential landmark discovery problem by introducing a correctable landmark discovery framework based on two powerful large models. We harvest rich landmark cooccurrence commonsense from ChatGPT and employ CLIP for conducting landmark discovery. A learnable cooccurrence scoring module is constructed to correct the priors provided by ChatGPT according to actual observations. Experimental results show that CONSOLE outperforms strong baselines consistently on R2R, REVERIE, R4R, and RxR. CONSOLE establishes new SOTA results on R2R and R4R under unseen scenarios. We believe that our work can provide a meaningful reference to the researchers in both the VLN and the embodied AI areas in how to effectively harvest and utilize the helpful knowledge inside large models for assisting robotic tasks, which is a very promising way to improve the performance of robots. In future work, we would like to resort to more low-cost large models for assisting embodied AI tasks, which is more practical in real-world applications. It is also worth studying to develop efficient in-domain pretraining paradigms to adapt large models to embodied AI tasks.
Acknowledgement
This work was supported in part by National Science and Technology Major Project (2020AAA0109704), Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Mobility Grant Award under Grant No. M-0461, Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Nansha Key RD Program under Grant No.2022ZD014, CAAI-Huawei MindSpore Open Fund. We thank MindSpore for the partial support of this work, which is a new deep learning computing framwork.