Robust and Accurate Camera Localisation at a Large Scale

Liu, Liu

Robust and Accurate Camera Localisation at a Large Scale

Date

2021

Authors

Liu, Liu

Abstract

The task of camera-based localization aims to quickly and precisely pinpoint at which location (and viewing direction) the image was taken, against a pre-stored large-scale map of the environment. This technique can be used in many 3D computer vision applications, e.g., AR/VR and autonomous driving. Mapping the world is the first step to enable camera-based localization since a pre-stored map serves as a reference for a query image/sequence. In this thesis, we exploit three readily available sources: (i) satellite images; (ii) ground-view images; (iii) 3D points cloud. Based on the above three sources, we propose solutions to localize a query camera both effectively and efficiently, i.e., accurately localizing a query camera under a variety of lighting and viewing conditions within a small amount of time. The main contributions are summarized as follows. In chapter 3, we separately present a minimal 4-point and 2-point solver to estimate a relative and absolute camera pose. The core idea is exploiting the vertical direction from IMU or vanishing point to derive a closed-form solution of a quartic equation and a quadratic equation for the relative and absolute camera pose, respectively. In chapter 4, we localize a ground-view query image against a satellite map. Inspired by the insight that humans commonly use orientation information as an important cue for spatial localization, we propose a method that endows deep neural networks with the 'commonsense' of orientation. We design a Siamese network that explicitly encodes each pixel's orientation of the ground-view and satellite images. Our method boosts the learned deep features' discriminative power, outperforming all previous methods. In chapter 5, we localize a ground-view query image against a ground-view image database. We propose a representation learning method having higher location-discriminating power. The core idea is learning discriminative image embedding. Similarities among intra-place images (viewing the same landmarks) are maximized while similarities among inter-place images (viewing different landmarks) are minimized. The method is easy to implement and pluggable into any CNN. Experiments show that our method outperforms all previous methods. In chapter 6, we localize a ground-view query image against a large-scale 3D points cloud with visual descriptors. To address the ambiguities in direct 2D--3D feature matching, we introduce a global matching method that harnesses global contextual information exhibited both within the query image and among all the 3D points in the map. The core idea is to find the optimal 2D set to 3D set matching. Tests on standard benchmark datasets show the effectiveness of our method. In chapter 7, we localize a ground-view query image against a 3D points cloud with only coordinates. The problem is also known as blind Perspective-n-Point. We propose a deep CNN model that simultaneously solves for both the 6-DoF absolute camera pose and 2D--3D correspondences. The core idea is extracting point-wise 2D and 3D features from their coordinates and matching 2D and 3D features effectively in a global feature matching module. Extensive tests on both real and simulated data have shown that our method substantially outperforms existing approaches. Last, in chapter 8, we study the potential of using 3D lines. Specifically, we study the problem of aligning two partially overlapping 3D line reconstructions in Euclidean space. This technique can be used for localization with respect to a 3D line database when query 3D line reconstructions are available (e.g., from stereo triangulation). We propose a neural network, taking Pluecker representations of lines as input, and solving for line-to-line matches and estimate a 6-DoF rigid transformation. Experiments on indoor and outdoor datasets show that our method's registration (rotation and translation) precision outperforms baselines significantly.