2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00874
|View full text |Cite
|
Sign up to set email alerts
|

Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense

Abstract: We propose a new 3D holistic ++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction-3D estimations of object bounding boxes, camera pose, and room layout, and (ii) 3D human pose estimation. The intuition behind is to leverage the coupled nature of these two tasks to improve the granularity and performance of scene understanding. We propose to exploit two critical and essential connections between these two tasks: (i) human-object … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
65
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 107 publications
(65 citation statements)
references
References 58 publications
0
65
0
Order By: Relevance
“…Computer vision (CV)-based human motion modelling and analysis has been extensively researched by the community. But, most of the research can be categorised into pose estimation [160], human-object interaction [63,98], activity/gesture recognition [31,65,113] or human-human interaction [53]. However, comparative analysis of human motion has received relatively less attention from the community.…”
Section: Introductionmentioning
confidence: 99%
“…Computer vision (CV)-based human motion modelling and analysis has been extensively researched by the community. But, most of the research can be categorised into pose estimation [160], human-object interaction [63,98], activity/gesture recognition [31,65,113] or human-human interaction [53]. However, comparative analysis of human motion has received relatively less attention from the community.…”
Section: Introductionmentioning
confidence: 99%
“…Early works only focus on room layout estimation [12,21,25,5,35] to represent rooms with a bounding box. With the advance of CNNs, more methods are developed to estimate object poses beyond the layout [7,14,1]. Still, these methods are limited to the prediction of the 3D bounding box of each furniture.…”
Section: Related Workmentioning
confidence: 99%
“…Output shape (1) Input Object images in a scene Nx3x256x256 (2) Input Geometry features [13,48] N x N x 64 (3) (1) ResNet-34 [11] Nx2048 (4) (2), (3) Relation Module [13] Nx2048 (5) (3), (4) Element-wise sum Nx2048 (6) (5) (2) Table 7: Architecture of Layout Estimation Network. LEN takes the full scene image as input and produces the camera pitch β and roll γ angles, the 3D layout center C, size s and orientation θ in the world system.…”
Section: Index Inputsmentioning
confidence: 99%
“…Monszpart et al [26] recover both a plausible scene arrangement and human motions to fit an input monocular video by jointly reasoning about scene objects and human motions over space-time. Chen et al [4] jointly learn scene parsing, object bounding-boxes, camera pose, room layout and 3D human pose estimation given a single-view image. Li et al [21] learn a 3D pose generative model to automatically put 3D body skeletons into the input scene represented by RGB, RGB-D, or depth image.…”
Section: Related Workmentioning
confidence: 99%