Dense indoor scene modeling from 2D images has been bottlenecked due to the absence of depth information and cluttered occlusions. We present an automatic indoor scene modeling approach using deep features from neural networks. Given a single RGB image, our method simultaneously recovers semantic contents, 3D geometry and object relationship by reasoning indoor environment context. Particularly, we design a shallow-to-deep architecture on the basis of convolutional networks for semantic scene understanding and modeling. It involves multi-level convolutional networks to parse indoor semantics/geometry into non-relational and relational knowledge. Non-relational knowledge extracted from shallow-end networks (e.g. room layout, object geometry) is fed forward into deeper levels to parse relational semantics (e.g. support relationship). A Relation Network is proposed to infer the support relationship between objects. All the structured semantics and geometry above are assembled to guide a global optimization for 3D scene modeling. Qualitative and quantitative analysis demonstrates the feasibility of our method in understanding and modeling semantics-enriched indoor scenes by evaluating the performance of reconstruction accuracy, computation performance and scene complexity. Shallow2Deep: Indoor Scene Modeling by Single Image Understanding vision tasks [1] and most of them are still under active development, e.g. object segmentation [2], layout estimation [3] and geometric reasoning [4]. Although machine intelligence has reached comparable human-level performance in some tasks (e.g. scene recognition [5]), those techniques are only able to represent a fragment knowledge of full scene context.With the lack of depth clues, prior studies reconstructed indoor scenes from a single image by exploiting shallow image features (e.g. line segments and HOG descriptors [6,4]) or introducing depth estimation [7,8] to search object models. Other works adopt Render-and-Match strategy to obtain CAD scenes with their renderings similar as input images [9]. However, it is still an unresolved problem when indoor geometry is over-cluttered and complicated. The reasons are threefold. First, complicated indoor scenes involve heavily occluded objects, which could cause missing contents in detection [9]. Second, cluttered environments significantly increase the difficulty of camera and layout estimations, which critically affects the reconstruction quality [10]. Third, compared to the large diversity of objects in real scenes, the reconstructed virtual environment is still far from satisfactory (missing small pieces, wrong labeling). Existing methods have explored the use of various contextual knowledge, including object support relationship [7,8] and human activity [7], to improve modeling quality. However, their relational (or contextual) features are hand-crafted and would fail to cover a wide range of objects in cluttered scenes.