Self-supervised Visual Geometry Learning
Abstract
Visual geometry learning aims to recover 3D geometry information i.e., surface normal, depth maps and camera poses from images. As a classic task in computer vision, this problem has been studied extensively for decades. It contains depth completion, stereo matching, monocular depth estimation, optical flow, visual odometry, structure from motion and etc. This thesis is dedicated to solving these problems from both conventional learning and deep learning perspectives. Like most data-driven methods, supervised deep learning-based methods require a large amount of labeled training data and suffer limited generalization ability. Selfsupervised learning is a technique that allows a network to learn feature representations without labeled data. In this thesis, we investigate the problem of applying self-supervised learning techniques to visual geometry learning and push the limit of the state of the art in terms of accuracy, speed, and generalization ability in visual geometry recovery tasks. In the depth completion task, two conventional optimization-based methods are proposed. The first one assumes a dense depth map can be approximated by a weighted sum of a set of principal components and enforces this assumption as a global geometric constraint. A colour-guided auto-regression model is applied to make the estimated depth map have sharp object boundaries. The proposed method can be efficiently solved in a closed form and outperforms previous methods. The other method further enforces a piecewise planar model to depth completion task and formulates it as a continuous Conditional Random Field (CRF) optimization problem. Experiments show that the proposed method is faster and more accurate than previous methods. In the stereo matching task, we propose to solve this problem through a deep self-supervised framework. Conventional optimization-based methods often require several seconds to minutes to process a sample, which makes them infeasible for time-critical applications such as autonomous driving and robotics. Moreover, supervised deep methods often require a large number of ground truth labels for training and suffer limited generalization capability. By leveraging self-supervised learning, our self-supervised stereo matching networks will not need any labeled data and can adapt themselves to new scenarios on-the-fly. The key idea is to make several assumptions of scenes and formulate them into loss functions, then optimize them through backpropagation. The loss functions are similar to the energy functions in conventional optimization-based methods but we are allowed to use more complex loss functions to describe a scene more precisely. Experiments demonstrate that the proposed methods have better performance in terms of both speed and accuracy. A similar strategy is also applied to the LiDAR-Stereo fusion task. A “feedback loop” is proposed to deal with the noise in LiDAR measurements. We also extend stereo matching to stereo video matching problem by utilizing convolutional LSTM modules to handle temporal consistency in videos. To deal with time-critical applications, we present a super-efficient stereo matching network structure that can process HD images at 100 FPS. We also leverage AutoML techniques i.e., neural architecture search (NAS), to find an optimal architecture for deep stereo matching and achieve top 1 accuracy among various benchmarks with far less trainable parameters. We further define a new problem called single mixture image depth estimation. Here, the single image can be a mixture of a stereo pair in a form of I = αI le f t + (1 − α)I right. Depending on the choice of α, this task can be seen as RedCyan depth, Double vision depth, and monocular depth estimation. Instead of brute force regressing depth from a single image, we divide the task into two sub-tasks: image separation and stereo matching. We first decouple the mixed image through an image separation module and then do stereo matching on the separated pairs. The whole system only needs original stereo pairs as supervisions and has better performance than previous methods. In the optical flow task, we investigate multiple ways to enforce the global epipolar constraint in self-supervised optical flow estimation. For stationary scenes, a fundamental matrix constraint is present. We first estimate a fundamental matrix from matching points and regularize optical flow with the Sampson distance. For dynamic scenarios, we propose a low-rank constraint and a union-of-subspaces constraint. They avoid explicitly computing the fundamental matrix as well as multi-motion estimation. Experiments on various benchmarks demonstrate the effectiveness of our method. In the structure from motion (SfM) task, we revisit existing deep learning-based approaches and find that they all formulate the problem in ways that are fundamentally ill-posed, relying on training data to overcome the inherent difficulties. In contrast, we propose a new deep framework that leverages the well-posedness of the classic SfM pipeline. We also propose a scale-invariant matching module to handle the scale ambiguities in the monocular SfM task. Our framework outperforms all state-of-the-art two-view SfM methods by a clear margin on various benchmarks in both relative pose estimation and depth estimation tasks. Apart from these tasks, we also show how to apply geometry information to high-level computer vision tasks, i.e., RGB-D semantic segmentation. We propose a natural and direct 3D representation to encode RGB-D data and regularize them with a light-weighted 3D convolutional network. State-of-the-art performance of our method on various datasets suggests that such a simple 3D representation is effective in incorporating 3D geometric information.
Description
Keywords
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
Open Access
License Rights
Restricted until
Downloads
File
Description
Thesis Material