Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents
Date
2018
Authors
Anderson, Peter James
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Each time we ask for an object, describe a scene, follow
directions or read a document containing images or figures, we
are converting information between visual and linguistic
representations. Indeed, for many tasks it is essential to reason
jointly over visual and linguistic information. People do this
with ease, typically without even noticing. Intelligent systems
that perform useful tasks in unstructured situations, and
interact with people, will also require this ability.
In this thesis, we focus on the joint modelling of visual and
linguistic information using deep neural networks. We begin by
considering the challenging problem of automatically describing
the content of an image in natural language, i.e., image
captioning. Although there is considerable interest in this task,
progress is hindered by the difficulty of evaluating the
generated captions. Our first contribution is a new automatic
image caption evaluation metric that measures the quality of
generated captions by analysing their semantic content. Extensive
evaluations across a range of models and datasets indicate that
our metric, dubbed SPICE, shows high correlation with human
judgements.
Armed with a more effective evaluation metric, we address the
challenge of image captioning. Visual attention mechanisms have
been widely adopted in image captioning and visual question
answering (VQA) architectures to facilitate fine-grained visual
processing. We extend existing approaches by proposing a
bottom-up and top-down attention mechanism that enables attention
to be focused at the level of objects and other salient image
regions, which is the natural basis for attention to be
considered. Applying this approach to image captioning we achieve
state of the art results on the COCO test server. Demonstrating
the broad applicability of the method, applying the same approach
to VQA we obtain first place in the 2017 VQA Challenge.
Despite these advances, recurrent neural network (RNN) image
captioning models typically do not generalise well to
out-of-domain images containing novel scenes or objects. This
limitation severely hinders the use of these models in real
applications. To address this problem, we propose constrained
beam search, an approximate search algorithm that enforces
constraints over RNN output sequences. Using this approach, we
show that existing RNN captioning architectures can take
advantage of side information such as object detector outputs and
ground-truth image annotations at test time, without retraining.
Our results significantly outperform previous approaches that
incorporate the same information into the learning algorithm,
achieving state of the art results for out-of-domain captioning
on COCO.
Last, to enable and encourage the application of vision and
language methods to problems involving embodied agents, we
present the Matterport3D Simulator, a large-scale interactive
reinforcement learning environment constructed from
densely-sampled panoramic RGB-D images of 90 real buildings.
Using this simulator, which can in future support a range of
embodied vision and language tasks, we collect the first
benchmark dataset for visually-grounded natural language
navigation in real buildings. We investigate the difficulty of
this task, and particularly the difficulty of operating in unseen
environments, using several baselines and a sequence-to-sequence
model based on methods successfully applied to other vision and
language tasks.
Description
Keywords
image caption generation, image captioning, automatic image description, visual question answering, VQA, COCO, COCO dataset, vision and language, language and vision, vision and language navigation, VLN, SPICE, SPICE metric, image caption evaluation, image caption evaluation metric, bottom up and top down attention, visual attention, image attention, Matterport, Matterport3D, Matterport3D Simulator, constrained beam search, embodied agents, vision and language agents
Citation
Collections
Source
Type
Thesis (PhD)
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description