Skip navigation
Skip navigation

2D+3D Indoor Scene Understanding from a Single Monocular Image

Zhuo, Wei

Description

Scene understanding, as a broad field encompassing many subtopics, has gained great interest in recent years. Among these subtopics, indoor scene understanding, having its own specific attributes and challenges compared to outdoor scene under- standing, has drawn a lot of attention. It has potential applications in a wide variety of domains, such as robotic navigation, object grasping for personal robotics, augmented reality, etc. To our knowledge, existing...[Show more]

dc.contributor.authorZhuo, Wei
dc.date.accessioned2018-06-28T04:07:13Z
dc.date.available2018-06-28T04:07:13Z
dc.identifier.otherb53507010
dc.identifier.urihttp://hdl.handle.net/1885/144616
dc.description.abstractScene understanding, as a broad field encompassing many subtopics, has gained great interest in recent years. Among these subtopics, indoor scene understanding, having its own specific attributes and challenges compared to outdoor scene under- standing, has drawn a lot of attention. It has potential applications in a wide variety of domains, such as robotic navigation, object grasping for personal robotics, augmented reality, etc. To our knowledge, existing research for indoor scenes typically makes use of depth sensors, such as Kinect, that is however not always available. In this thesis, we focused on addressing the indoor scene understanding tasks in a general case, where only a monocular color image of the scene is available. Specifically, we first studied the problem of estimating a detailed depth map from a monocular image. Then, benefiting from deep-learning-based depth estimation, we tackled the higher-level tasks of 3D box proposal generation, and scene parsing with instance segmentation, semantic labeling and support relationship inference from a monocular image. Our research on indoor scene understanding provides a comprehensive scene interpretation at various perspectives and scales. For monocular image depth estimation, previous approaches are limited in that they only reason about depth locally on a single scale, and do not utilize the important information of geometric scene structures. Here, we developed a novel graphical model, which reasons about detailed depth while leveraging geometric scene structures at multiple scales. For 3D box proposals, to our best knowledge, our approach constitutes the first attempt to reason about class-independent 3D box proposals from a single monocular image. To this end, we developed a novel integrated, differentiable framework that estimates depth, extracts a volumetric scene representation and generates 3D proposals. At the core of this framework lies a novel residual, differentiable truncated signed distance function module, which is able to handle the relatively low accuracy of the predicted depth map. For scene parsing, we tackled its three subtasks of instance segmentation, se- mantic labeling, and the support relationship inference on instances. Existing work typically reasons about these individual subtasks independently. Here, we leverage the fact that they bear strong connections, which can facilitate addressing these sub- tasks if modeled properly. To this end, we developed an integrated graphical model that reasons about the mutual relationships of the above subtasks. In summary, in this thesis, we introduced novel and effective methodologies for each of three indoor scene understanding tasks, i.e., depth estimation, 3D box proposal generation, and scene parsing, and exploited the dependencies on depth estimates of the latter two tasks. Evaluation on several benchmark datasets demonstrated the effectiveness of our algorithms and the benefits of utilizing depth estimates for higher-level tasks.
dc.language.isoen
dc.subjectScene Understanding
dc.subjectMonocular Image Processing
dc.subjectDepth Estimation
dc.subject3D Box Proposal
dc.subjectSemantic Labeling
dc.subjectInstance Segmentation
dc.subjectSupport Relationship Inference
dc.title2D+3D Indoor Scene Understanding from a Single Monocular Image
dc.typeThesis (PhD)
local.contributor.supervisorSalzmann, Mathieu
dcterms.valid2018
local.description.notesthe author deposited 28/06/2018
local.type.degreeDoctor of Philosophy (PhD)
dc.date.issued2018
local.contributor.affiliationCollege of Engineering and Computer Science, The Australian National University
local.identifier.doi10.25911/5d67b4a8aaee7
local.mintdoimint
CollectionsOpen Access Theses

Download

File Description SizeFormat Image
Zhuo Thesis 2018.pdf16.99 MBAdobe PDFThumbnail


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  17 November 2022/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator