Determining Visual Motion in the Deep Learning Era
Abstract
Determining visual motion, or optical flow, is a fundamental problem in computer vision and
has stimulated continuous research interests in the past few decades.
Other than pure academic pursuit, the progress made in optical flow research also has applications
in many fields, including video processing, graphics, robotics and medical applications.
Traditionally, optical flow estimation has been formulated as solving an optimisation problem,
often by minimising an energy function. The energy function is designed based on the brightness
constancy assumption, which often fails in real-world scenarios due to lighting changes,
shadows and occlusions, resulting in the failure of traditional algorithms. Another weakness
of traditional optimisation approaches is the slow runtime, since iterative methods are often
employed when solving for the optical flow, which can take as long as a few seconds to a minute.
This becomes problematic in real-world applications.
The recent surge of
deep learning techniques has enabled the formulation of optical flow estimation
as a learning problem. Recent papers have shown significant performance improvements compared
to traditional approaches as well as significantly faster runtime.
Despite the recent progress in the learning approaches for optical flow,
there still remain challenging
cases where current approaches fail, such as occlusions, featureless
regions (the aperture problem), and large motions for small objects.
Current methods are also limited by the large consumption of GPU memory.
An intermediate representation named cost volume is often employed which
scales quadratically with the number of pixels. This 4D representation acts as a memory
bottleneck for modern optical flow approaches, which prevents scaling up to high-resolution
images.
In this PhD thesis, we show long-range modelling and sparse representations are
important cornerstones for modern optical flow estimation.
We first show regularising flow prediction
with an estimated essential matrix can improve flow prediction performance in mostly rigid scenes,
particularly challenging cases such as featureless regions and motion blur. We then demonstrate that
a sparse cost volume can just be as effective as a dense cost volume, with
significantly less memory consumption. This brings hope
for future optical flow research where image resolutions are further increased.
Finally, we show that incorporating a self-attention module to globally aggregate motion features
helps improve state-of-the-art flow prediction.
Modelling long-range connections are particularly helpful
for dealing with occlusions.
Description
Keywords
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material