Ma, Jiahao2023-05-182023-05-18http://hdl.handle.net/1885/291536Occlusion is a fundamental challenge faced by many computer vision tasks. In particular, monocular detection systems struggle to detect individuals concealed behind obstacles. Multiview systems, which utilize multiple camera views with overlapping fields of view, can effectively locate occluded objects in crowded scenes. Existing methods in this field generally employ an "object modeling - aggregation" strategy. In this thesis, we propose two approaches to address the problem of effective object modeling for multi-view general object detection in large-scale and crowded scenes. Our proposed methods demonstrate competitive performance in multi-view 3D object detection (e.g., cattle and robots) and multi-view pedestrian localization (e.g., pedestrians). Additionally, we introduce a large-scale synthetic dataset, MultiviewC, to promote the development of multi-view 3D object detection. In the first approach, we propose VFA (Voxelized 3D Feature Aggregation) to overcome the limitations of feature-transformation-based methods that neglect the vertical direction of objects. We voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and aggregate 2D features along the same vertical line, significantly reducing projection distortions. In the second approach, we introduce a novel pedestrian representation scheme based on human point cloud modeling. Using ray tracing for holistic human depth estimation, we model the pedestrian as a thin, upright point cloud on the ground. We then aggregate these cardboard-like point clouds to determine the pedestrians' positions. Compared to existing methods, our proposed approach explicitly leverages pedestrian appearance and significantly reduces projection errors by estimating pedestrians' heights. Existing multi-view detection datasets contain limited data and lack object orientation labels, hindering the development of multi-view monocular object detection. To address this issue, this dissertation presents a new large-scale, high-resolution dataset called MultiviewC. This dataset contains rich and accurate label information beneficial for 3D object detection, multi-view detection, and action recognition. In summary, the goal of this thesis is to enable vision systems to effectively predict occluded objects in crowded scenes. We propose several novel ideas for constructing feature representations that push the boundaries of current state-of-the-art solutions for multi-view object detection.en-AUMultiview pedestrian detection in crowded scene202310.25911/V40R-NG56