Vision-Language Navigation (VLN) is a challenging task that requires an embodied agent to perform action-level modality alignment, i.e., make instruction-asked actions sequentially in complex visual environments. Most existing VLN agents learn the instruction-path data directly and cannot sufficiently explore action-level alignment knowledge inside the multi-modal inputs. In this paper, we propose modAlity-aligneD Action PrompTs (ADAPT), which provides the VLN agent with action prompts to enable the explicit learning of action-level modality alignment to pursue successful navigation. Specifically, an action prompt is defined as a modality-aligned pair of an image sub-prompt and a text sub-prompt, where the former is a single-view observation and the latter is a phrase like “walk past the chair”. When starting navigation, the instruction-related action prompt set is retrieved from a pre-built action prompt base and passed through a prompt encoder to obtain the prompt feature. Then the prompt feature is concatenated with the original instruction feature and fed to a multi-layer transformer for action prediction. To collect high-quality action prompts into the prompt base, we use the Contrastive Language-Image Pretraining (CLIP) model which has powerful cross-modality alignment ability. A modality alignment loss and a sequential consistency loss are further introduced to enhance the alignment of the action prompt and enforce the agent to focus on the related prompt sequentially. Experimental results on both R2R and RxR show the superiority of ADAPT over state-of-the-art methods.
In this work, we propose modality-aligned action prompts (ADAPT), which prompts the VLN agent with explicit cross-modal action knowledge for enhancing the navigation performance. During navigation, the agent retrieves the action prompts from a pre-built action prompt base. Then the prompt-based instruction features are obtained for improving action decision. The CLIP model is used to collect high-quality action prompts into the prompt base. We also propose a modality alignment loss and a sequential consistency loss for training. Experiments on the public VLN benchmarks show the effectiveness of our ADAPT, which establishes new SOTA results. We hope this work can offer new directions for prompt-based navigation research. With regards to the limitation of our work, our constructed action prompt base in ADAPT contains more or less noise due to the ability of CLIP, the scene complexity and instruction diversity in the VLN task. The future work includes finding action prompts of better quality.