Deep Multiview Understanding: Design and Optimisation of Multiple Camera Systems
Abstract
Incorporating multiple camera views (multiview) can help computer vision systems reduce ambiguities, mitigate occlusions, and increase field-of-view coverage. Multiview setups have demonstrated significant benefits in classification, detection, and tracking. In this thesis, we investigate how to improve the performance of multiview understanding using deep learning and how to design better multiview camera layouts. The initial focus is on enhancing algorithm performance for given multiview setups. Specifically, we study multiview detection and tracking tasks and propose novel methods that establish new paradigms and achieve state-of-the-art results. Building upon these achievements, the thesis further explores the design of multiview camera layouts for improved efficiency. The thesis commences by introducing novel deep-learning algorithms for multiview detection, which address occlusions in crowded scenes. In Chapter 2, we propose a fully convolutional multiview detector, MVDet. It projects feature maps from each camera view onto the bird's-eye-view ground plane for multiview aggregation. By adopting a fully convolutional design, MVDet enables end-to-end training and outperforms previous methods. However, MVDet faces challenges due to distinct projection distortions based on target positions and cameras. To address this issue, Chapter 3 presents MVDeTr, which utilizes a shadow transformer to aggregate multiview information. Unlike convolutions, the shadow transformer adapts to various shadow-like distortions and can attend differently at different positions and cameras. Additionally, a view-coherent data augmentation strategy is introduced to maintain multiview consistency during training. The thesis then shifts focus to tracking problems, where multiple camera views are utilized to expand the field-of-view. In Chapter 4, we investigate multi-target multi-camera tracking (MTMCT) and propose a simple yet effective approach to adapt affinity estimations to corresponding matching scopes in MTMCT. Instead of addressing all appearance changes, the proposed adaptive affinity specializes in appearance changes that may occur during tracking. To achieve this, we introduce a new data sampling scheme that utilizes temporal windows originally designed to limit the matching scopes in tracking, resulting in significant improvements over previous appearance metrics and competitive accuracy. Lastly, the thesis explores the design of multiview camera layouts to reduce the computational cost associated with multiple views. In Chapter 5, we propose MVSelect, a reinforcement learning based view selection module that analyzes the target object or scenario from given views and selects the next best view for processing. Experimental results on multiview classification and detection tasks demonstrate that our approach achieves promising accuracy while using only two or three out of N available views, thereby significantly reducing computational costs. Furthermore, the analysis of the selected views reveals that certain cameras can be deactivated with minimal accuracy impact, shedding light on future camera layout optimisation for multiview systems. In summary, this thesis investigates multiview understanding by designing deep learning algorithms for given multiview setups and optimising camera layouts to enhance efficiency. The findings presented herein lay the groundwork for future research on jointly optimising camera layout and deep learning architectures for multiview setups, with potential applications in various computer vision tasks.
Description
Keywords
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material