In order to perform complex tasks in realistic human environments, robots need to be able to learn new concepts in the wild, incrementally, and through their interactions with humans. This paper presents an end-to-end pipeline to learn object models incrementally during the human-robot interaction.The pipeline we propose consists of three parts: (a) recognizing the interaction type, (b) detecting the object that the interaction is targeting, and (c) learning incrementally the models from data recorded by the robot sensors. Our main contributions lie in the target object detection, guided by the recognized interaction, and in the incremental object learning. The novelty of our approach is the focus on natural, heterogeneous and multimodal human-robot interactions to incrementally learn new object models. Throughout the paper we highlight the main challenges associated with this problem, such as high degree of occlusion and clutter, domain change, low resolution data and interaction ambiguity. Our work shows the benefits of using multi-view approaches and combining visual and language features, and our experimental results outperform standard baselines.Note to Practitioners-This work was motivated by challenges in recognition tasks for dynamic and varying scenarios. Our approach learns to recognize new user interactions and objects. To do so, we use multimodal data from the user-robot interaction: visual data is used to learn the objects and speech is used to learn the label and help with the interaction type recognition. We use state-of-the-art deep learning models to segment the user and the objects in the scene. Our algorithm for incremental learning is based on a classic incremental clustering approach.The pipeline we propose works with all sensors mounted on the robot, so it allows mobility on the system. Our work uses data recorded from a Baxter robot, which enables the use of the manipulation arms in future steps, but it would work with any robot able to have the same sensors mounted. The sensors used are two RGB-D cameras and a microphone. The pipeline currently has high computational requirements to run the two deep learning based steps. We have tested it with a desktop computer including a GTX 1060 and 32GB of RAM.