CVPR 2020
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang
CVPR 2020

Abstract


Vision-Language Navigation (VLN) is a task where an agent learns to navigate following a natural language instruction. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches fully exploit vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have implicitly neglected the rich semantic information contained in environments (such as navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, evaluating the trajectory consistency, estimating the progress and predict the next direction. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Our experiments demonstrate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. We further demonstrate empirically that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.

Framework


Experiment


Conclusion


In this paper, we presented a novel framework, auxiliary Reasoning Navigation (AuxRN), that facilitates navigation learning with four auxiliary reasoning tasks. Our experiments confirm that AuxRN improves the performance of the VLN task quantitatively and qualitatively. We plan to build a general framework for auxiliary reasoning tasks to exploit the common sense information in the future.

 

Acknowledgement


This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and in part by the National Natural Science Foundation of China (NSFC) under Grant No.61976233, and by the Air Force Research Laboratory and DARPA under agreement number FA8750-19-2-0501, Australian Research Council Discovery Early Career Researcher Award (DE190100626).