We present InstanceFusion, a robust real-time system to detect, segment, and reconstruct instance-level 3D objects of indoor scenes with a hand-held RGBD camera. It combines the strengths of deep learning and traditional SLAM techniques to produce visually compelling 3D semantic models. The key success comes from our novel segmentation scheme and the efficient instancelevel data fusion, which are both implemented on GPU. Specifically, for each incoming RGBD frame, we take the advantages of the RGBD features, the 3D point cloud, and the reconstructed model to perform instance-level segmentation. The corresponding RGBD data along with the instance ID are then fused to the surfel-based models. In order to sufficiently store and update these data, we design and implement a new data structure using the OpenGL Shading Language. Experimental results show that our method advances the state-of-the-art (SOTA) methods in instance segmentation and data fusion by a big margin. In addition, our instance segmentation improves the precision of 3D reconstruction, especially in the loop closure. InstanceFusion system runs 20.5Hz on a consumer-level GPU, which supports a number of augmented reality (AR) applications (e.g., 3D model registration, virtual interaction, AR map) and robot applications (e.g., navigation, manipulation, grasping). To facilitate future research and reproduce our system more easily, the source code, data, and the trained model are released on Github: https://github.com/Fancomi2017/InstanceFusion. CCS Concepts • Computing methodologies → Scene understanding; Vision for robotics; Perception;