CM2

Abstract

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.

Overview

We are motivated by studies on navigation of biological systems that suggest humans build cognitive maps during such tasks. In contrast to other works which attempt to ground natural language on egocentric RGB-D observations, we argue that an egocentric map offers a more natural representation for this task. To this end, we propose a novel navigation system for the VLN task in continuous environments, that learns a language-informed representation for both map and trajectory prediction. Our method semantically grounds the language through an egocentric map prediction task that learns to hallucinate information outside the field-of-view of the agent. This is followed by spatial grounding of the instruction by path prediction on the egocentric map.

Architecture

At the core of our method are two cross-modal attention modules that learn language-informed representations to facilitate both the hallucination of semantics over unobserved areas as well as the prediction of a set of waypoints that the agent needs to follow to reach the goal. The components colored in blue refer to the map prediction part of our model, the ones in orange correspond to the path prediction, and the yellow boxes are the losses.

Learned Representations

Left: Visualization of cross-modal path attention decoder output that focuses on areas around goal locations and along paths. The agent's location is denoted with a green circle and the goal with an orange star. Right: Visualization of the cross-modal attention representation between map and specific word tokens. The representation tends to focus on semantic areas of the map that correspond to the object referred to by the token.

Results

Navigation examples. Top: RGB observations of the agent, predicted map and path (red dots), ground-truth map and path (blue dots). Bottom: depth observation, instruction, colors for semantic labels in the map. The maps are egocentric (the agent is in the middle looking upwards). Note that the goal is neither visible nor within the egocentric map at the beginning of the episodes.

Semantic map predictions with and without cross-modal map attention. The cross-modal map attention extracts useful information from language that improves the prediction of the semantic map.

Acknowledgements

Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-20-1-0080, as well as by the ARL DCIST CRA W911NF-17-2-0181, NSF TRIPODS 1934960, and NSF CPS 2038873 grants.

The design of this project page was based on this website.