Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents

dc.contributor.authorAnderson, Peter James
dc.date.accessioned2019-06-12T00:05:16Z
dc.date.available2019-06-12T00:05:16Z
dc.date.issued2018
dc.description.abstractEach time we ask for an object, describe a scene, follow directions or read a document containing images or figures, we are converting information between visual and linguistic representations. Indeed, for many tasks it is essential to reason jointly over visual and linguistic information. People do this with ease, typically without even noticing. Intelligent systems that perform useful tasks in unstructured situations, and interact with people, will also require this ability. In this thesis, we focus on the joint modelling of visual and linguistic information using deep neural networks. We begin by considering the challenging problem of automatically describing the content of an image in natural language, i.e., image captioning. Although there is considerable interest in this task, progress is hindered by the difficulty of evaluating the generated captions. Our first contribution is a new automatic image caption evaluation metric that measures the quality of generated captions by analysing their semantic content. Extensive evaluations across a range of models and datasets indicate that our metric, dubbed SPICE, shows high correlation with human judgements. Armed with a more effective evaluation metric, we address the challenge of image captioning. Visual attention mechanisms have been widely adopted in image captioning and visual question answering (VQA) architectures to facilitate fine-grained visual processing. We extend existing approaches by proposing a bottom-up and top-down attention mechanism that enables attention to be focused at the level of objects and other salient image regions, which is the natural basis for attention to be considered. Applying this approach to image captioning we achieve state of the art results on the COCO test server. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge. Despite these advances, recurrent neural network (RNN) image captioning models typically do not generalise well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real applications. To address this problem, we propose constrained beam search, an approximate search algorithm that enforces constraints over RNN output sequences. Using this approach, we show that existing RNN captioning architectures can take advantage of side information such as object detector outputs and ground-truth image annotations at test time, without retraining. Our results significantly outperform previous approaches that incorporate the same information into the learning algorithm, achieving state of the art results for out-of-domain captioning on COCO. Last, to enable and encourage the application of vision and language methods to problems involving embodied agents, we present the Matterport3D Simulator, a large-scale interactive reinforcement learning environment constructed from densely-sampled panoramic RGB-D images of 90 real buildings. Using this simulator, which can in future support a range of embodied vision and language tasks, we collect the first benchmark dataset for visually-grounded natural language navigation in real buildings. We investigate the difficulty of this task, and particularly the difficulty of operating in unseen environments, using several baselines and a sequence-to-sequence model based on methods successfully applied to other vision and language tasks.en_AU
dc.identifier.otherb59287020
dc.identifier.urihttp://hdl.handle.net/1885/164018
dc.language.isoen_AUen_AU
dc.subjectimage caption generationen_AU
dc.subjectimage captioningen_AU
dc.subjectautomatic image descriptionen_AU
dc.subjectvisual question answeringen_AU
dc.subjectVQAen_AU
dc.subjectCOCOen_AU
dc.subjectCOCO dataseten_AU
dc.subjectvision and languageen_AU
dc.subjectlanguage and visionen_AU
dc.subjectvision and language navigationen_AU
dc.subjectVLNen_AU
dc.subjectSPICEen_AU
dc.subjectSPICE metricen_AU
dc.subjectimage caption evaluationen_AU
dc.subjectimage caption evaluation metricen_AU
dc.subjectbottom up and top down attentionen_AU
dc.subjectvisual attentionen_AU
dc.subjectimage attentionen_AU
dc.subjectMatterporten_AU
dc.subjectMatterport3Den_AU
dc.subjectMatterport3D Simulatoren_AU
dc.subjectconstrained beam searchen_AU
dc.subjectembodied agentsen_AU
dc.subjectvision and language agentsen_AU
dc.titleVision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agentsen_AU
dc.typeThesis (PhD)en_AU
dcterms.valid2019en_AU
local.contributor.affiliationCollege of Engineering and Computer Science, The Australian National Universityen_AU
local.contributor.authoremailpeteanderson80@gmail.comen_AU
local.contributor.supervisorGould, Stephen
local.contributor.supervisorcontactstephen.gould@anu.edu.auen_AU
local.description.notesthe author deposited 12/06/2019en_AU
local.identifier.doi10.25911/5d00d4ec451cc
local.mintdoiminten_AU
local.type.degreeDoctor of Philosophy (PhD)en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Anderson Thesis 2019.pdf
Size:
28.7 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
884 B
Format:
Item-specific license agreed upon to submission
Description:
Back to topicon-arrow-up-solid
 
APRU
IARU
 
edX
Group of Eight Member

Acknowledgement of Country

The Australian National University acknowledges, celebrates and pays our respects to the Ngunnawal and Ngambri people of the Canberra region and to all First Nations Australians on whose traditional lands we meet and work, and whose cultures are among the oldest continuing cultures in human history.


Contact ANUCopyrightDisclaimerPrivacyFreedom of Information

+61 2 6125 5111 The Australian National University, Canberra

TEQSA Provider ID: PRV12002 (Australian University) CRICOS Provider Code: 00120C ABN: 52 234 063 906