Cultural advice

The Australian National University acknowledges, celebrates and pays our respects to the Ngunnawal and Ngambri people of the Canberra region and to all First Nations Australians on whose traditional lands we meet and work, and whose cultures are among the oldest continuing cultures in human history.

Aboriginal and Torres Strait Islander peoples are advised that ANU Library collections may include images, names, voices, and other representations of deceased persons.

Material in the collection may contain terms, language or views that reflect the period in which the item was created and may be considered inappropriate today.

Self-supervised Visual Geometry Learning

Loading...
Thumbnail Image

Date

Authors

Zhong, Yiran

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Visual geometry learning aims to recover 3D geometry information i.e., surface normal, depth maps and camera poses from images. As a classic task in computer vision, this problem has been studied extensively for decades. It contains depth completion, stereo matching, monocular depth estimation, optical flow, visual odometry, structure from motion and etc. This thesis is dedicated to solving these problems from both conventional learning and deep learning perspectives. Like most data-driven methods, supervised deep learning-based methods require a large amount of labeled training data and suffer limited generalization ability. Selfsupervised learning is a technique that allows a network to learn feature representations without labeled data. In this thesis, we investigate the problem of applying self-supervised learning techniques to visual geometry learning and push the limit of the state of the art in terms of accuracy, speed, and generalization ability in visual geometry recovery tasks. In the depth completion task, two conventional optimization-based methods are proposed. The first one assumes a dense depth map can be approximated by a weighted sum of a set of principal components and enforces this assumption as a global geometric constraint. A colour-guided auto-regression model is applied to make the estimated depth map have sharp object boundaries. The proposed method can be efficiently solved in a closed form and outperforms previous methods. The other method further enforces a piecewise planar model to depth completion task and formulates it as a continuous Conditional Random Field (CRF) optimization problem. Experiments show that the proposed method is faster and more accurate than previous methods. In the stereo matching task, we propose to solve this problem through a deep self-supervised framework. Conventional optimization-based methods often require several seconds to minutes to process a sample, which makes them infeasible for time-critical applications such as autonomous driving and robotics. Moreover, supervised deep methods often require a large number of ground truth labels for training and suffer limited generalization capability. By leveraging self-supervised learning, our self-supervised stereo matching networks will not need any labeled data and can adapt themselves to new scenarios on-the-fly. The key idea is to make several assumptions of scenes and formulate them into loss functions, then optimize them through backpropagation. The loss functions are similar to the energy functions in conventional optimization-based methods but we are allowed to use more complex loss functions to describe a scene more precisely. Experiments demonstrate that the proposed methods have better performance in terms of both speed and accuracy. A similar strategy is also applied to the LiDAR-Stereo fusion task. A “feedback loop” is proposed to deal with the noise in LiDAR measurements. We also extend stereo matching to stereo video matching problem by utilizing convolutional LSTM modules to handle temporal consistency in videos. To deal with time-critical applications, we present a super-efficient stereo matching network structure that can process HD images at 100 FPS. We also leverage AutoML techniques i.e., neural architecture search (NAS), to find an optimal architecture for deep stereo matching and achieve top 1 accuracy among various benchmarks with far less trainable parameters. We further define a new problem called single mixture image depth estimation. Here, the single image can be a mixture of a stereo pair in a form of I = αI le f t + (1 − α)I right. Depending on the choice of α, this task can be seen as RedCyan depth, Double vision depth, and monocular depth estimation. Instead of brute force regressing depth from a single image, we divide the task into two sub-tasks: image separation and stereo matching. We first decouple the mixed image through an image separation module and then do stereo matching on the separated pairs. The whole system only needs original stereo pairs as supervisions and has better performance than previous methods. In the optical flow task, we investigate multiple ways to enforce the global epipolar constraint in self-supervised optical flow estimation. For stationary scenes, a fundamental matrix constraint is present. We first estimate a fundamental matrix from matching points and regularize optical flow with the Sampson distance. For dynamic scenarios, we propose a low-rank constraint and a union-of-subspaces constraint. They avoid explicitly computing the fundamental matrix as well as multi-motion estimation. Experiments on various benchmarks demonstrate the effectiveness of our method. In the structure from motion (SfM) task, we revisit existing deep learning-based approaches and find that they all formulate the problem in ways that are fundamentally ill-posed, relying on training data to overcome the inherent difficulties. In contrast, we propose a new deep framework that leverages the well-posedness of the classic SfM pipeline. We also propose a scale-invariant matching module to handle the scale ambiguities in the monocular SfM task. Our framework outperforms all state-of-the-art two-view SfM methods by a clear margin on various benchmarks in both relative pose estimation and depth estimation tasks. Apart from these tasks, we also show how to apply geometry information to high-level computer vision tasks, i.e., RGB-D semantic segmentation. We propose a natural and direct 3D representation to encode RGB-D data and regularize them with a light-weighted 3D convolutional network. State-of-the-art performance of our method on various datasets suggests that such a simple 3D representation is effective in incorporating 3D geometric information.

Description

Keywords

Citation

Source

Book Title

Entity type

Access Statement

Open Access

License Rights

Restricted until

abcd