Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments

Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.; Sünderhauf, N.; Reid, I.D.; Gould, S.; Hengel, A.V.D.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/120112

Scopus	Web of Science®	Altmetric
Citations
?	?

Type:	Conference paper
Title:	Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments
Author:	Anderson, P. Wu, Q. Teney, D. Bruce, J. Johnson, M. Sünderhauf, N. Reid, I.D. Gould, S. Hengel, A.V.D.
Citation:	Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, vol.abs/1711.07280, pp.3674-3683
Publisher:	IEEE
Issue Date:	2018
Series/Report no.:	IEEE Conference on Computer Vision and Pattern Recognition
ISBN:	9781538664209
ISSN:	2575-7075
Conference Name:	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (18 Jun 2018 - 23 Jun 2018 : Salt Lake City, UT)
Statement of Responsibility:	Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S, underhauf, Ian Reid, Stephen Gould, Anton van den Hengel
Abstract:	A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset1.
Rights:	© 2018 IEEE
DOI:	10.1109/CVPR.2018.00387
Grant ID:	http://purl.org/au-research/grants/arc/CE140100016 http://purl.org/au-research/grants/arc/DP160102156
Published version:	http://dx.doi.org/10.1109/cvpr.2018.00387
Appears in Collections:	Aurora harvest 8 Computer Science publications

Files in This Item:

There are no files associated with this item.

Show full item record

Adelaide Research & Scholarship