Determining Visual Motion in the Deep Learning Era




Jiang, Shihao

Journal Title

Journal ISSN

Volume Title



Determining visual motion, or optical flow, is a fundamental problem in computer vision and has stimulated continuous research interests in the past few decades. Other than pure academic pursuit, the progress made in optical flow research also has applications in many fields, including video processing, graphics, robotics and medical applications. Traditionally, optical flow estimation has been formulated as solving an optimisation problem, often by minimising an energy function. The energy function is designed based on the brightness constancy assumption, which often fails in real-world scenarios due to lighting changes, shadows and occlusions, resulting in the failure of traditional algorithms. Another weakness of traditional optimisation approaches is the slow runtime, since iterative methods are often employed when solving for the optical flow, which can take as long as a few seconds to a minute. This becomes problematic in real-world applications. The recent surge of deep learning techniques has enabled the formulation of optical flow estimation as a learning problem. Recent papers have shown significant performance improvements compared to traditional approaches as well as significantly faster runtime. Despite the recent progress in the learning approaches for optical flow, there still remain challenging cases where current approaches fail, such as occlusions, featureless regions (the aperture problem), and large motions for small objects. Current methods are also limited by the large consumption of GPU memory. An intermediate representation named cost volume is often employed which scales quadratically with the number of pixels. This 4D representation acts as a memory bottleneck for modern optical flow approaches, which prevents scaling up to high-resolution images. In this PhD thesis, we show long-range modelling and sparse representations are important cornerstones for modern optical flow estimation. We first show regularising flow prediction with an estimated essential matrix can improve flow prediction performance in mostly rigid scenes, particularly challenging cases such as featureless regions and motion blur. We then demonstrate that a sparse cost volume can just be as effective as a dense cost volume, with significantly less memory consumption. This brings hope for future optical flow research where image resolutions are further increased. Finally, we show that incorporating a self-attention module to globally aggregate motion features helps improve state-of-the-art flow prediction. Modelling long-range connections are particularly helpful for dealing with occlusions.






Thesis (PhD)

Book Title

Entity type

Access Statement

License Rights



Restricted until