Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents

Anderson, Peter James

Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents

dc.contributor.author	Anderson, Peter James
dc.date.accessioned	2019-06-12T00:05:16Z
dc.date.available	2019-06-12T00:05:16Z
dc.date.issued	2018
dc.description.abstract	Each time we ask for an object, describe a scene, follow directions or read a document containing images or figures, we are converting information between visual and linguistic representations. Indeed, for many tasks it is essential to reason jointly over visual and linguistic information. People do this with ease, typically without even noticing. Intelligent systems that perform useful tasks in unstructured situations, and interact with people, will also require this ability. In this thesis, we focus on the joint modelling of visual and linguistic information using deep neural networks. We begin by considering the challenging problem of automatically describing the content of an image in natural language, i.e., image captioning. Although there is considerable interest in this task, progress is hindered by the difficulty of evaluating the generated captions. Our first contribution is a new automatic image caption evaluation metric that measures the quality of generated captions by analysing their semantic content. Extensive evaluations across a range of models and datasets indicate that our metric, dubbed SPICE, shows high correlation with human judgements. Armed with a more effective evaluation metric, we address the challenge of image captioning. Visual attention mechanisms have been widely adopted in image captioning and visual question answering (VQA) architectures to facilitate fine-grained visual processing. We extend existing approaches by proposing a bottom-up and top-down attention mechanism that enables attention to be focused at the level of objects and other salient image regions, which is the natural basis for attention to be considered. Applying this approach to image captioning we achieve state of the art results on the COCO test server. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge. Despite these advances, recurrent neural network (RNN) image captioning models typically do not generalise well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real applications. To address this problem, we propose constrained beam search, an approximate search algorithm that enforces constraints over RNN output sequences. Using this approach, we show that existing RNN captioning architectures can take advantage of side information such as object detector outputs and ground-truth image annotations at test time, without retraining. Our results significantly outperform previous approaches that incorporate the same information into the learning algorithm, achieving state of the art results for out-of-domain captioning on COCO. Last, to enable and encourage the application of vision and language methods to problems involving embodied agents, we present the Matterport3D Simulator, a large-scale interactive reinforcement learning environment constructed from densely-sampled panoramic RGB-D images of 90 real buildings. Using this simulator, which can in future support a range of embodied vision and language tasks, we collect the first benchmark dataset for visually-grounded natural language navigation in real buildings. We investigate the difficulty of this task, and particularly the difficulty of operating in unseen environments, using several baselines and a sequence-to-sequence model based on methods successfully applied to other vision and language tasks.	en_AU
dc.identifier.other	b59287020
dc.identifier.uri	http://hdl.handle.net/1885/164018
dc.language.iso	en_AU	en_AU
dc.subject	image caption generation	en_AU
dc.subject	image captioning	en_AU
dc.subject	automatic image description	en_AU
dc.subject	visual question answering	en_AU
dc.subject	VQA	en_AU
dc.subject	COCO	en_AU
dc.subject	COCO dataset	en_AU
dc.subject	vision and language	en_AU
dc.subject	language and vision	en_AU
dc.subject	vision and language navigation	en_AU
dc.subject	VLN	en_AU
dc.subject	SPICE	en_AU
dc.subject	SPICE metric	en_AU
dc.subject	image caption evaluation	en_AU
dc.subject	image caption evaluation metric	en_AU
dc.subject	bottom up and top down attention	en_AU
dc.subject	visual attention	en_AU
dc.subject	image attention	en_AU
dc.subject	Matterport	en_AU
dc.subject	Matterport3D	en_AU
dc.subject	Matterport3D Simulator	en_AU
dc.subject	constrained beam search	en_AU
dc.subject	embodied agents	en_AU
dc.subject	vision and language agents	en_AU
dc.title	Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents	en_AU
dc.type	Thesis (PhD)	en_AU
dcterms.valid	2019	en_AU
local.contributor.affiliation	College of Engineering and Computer Science, The Australian National University	en_AU
local.contributor.authoremail	peteanderson80@gmail.com	en_AU
local.contributor.supervisor	Gould, Stephen
local.contributor.supervisorcontact	stephen.gould@anu.edu.au	en_AU
local.description.notes	the author deposited 12/06/2019	en_AU
local.identifier.doi	10.25911/5d00d4ec451cc
local.mintdoi	mint	en_AU
local.type.degree	Doctor of Philosophy (PhD)	en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1

Name:: Anderson Thesis 2019.pdf
Size:: 28.7 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 884 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Open Access Theses