Interaction plays a critical role in skills learning for natural communication. In human-robot interaction (HRI), robots can get feedback during the interaction to improve their social abilities. In this context, we propose an interactive robot learning framework using multimodal data from thermal facial images and human gait data for online emotion recognition. We also propose a new decision-level fusion method for the multimodal classification using Random Forest (RF) model. Our hybrid online emotion recognition model focuses on the detection of four human emotions (i.e., neutral, happiness, angry, and sadness). After conducting offline training and testing with the hybrid model, the accuracy of the online emotion recognition system is more than 10% lower than the offline one. In order to improve our system, the human verbal feedback is injected into the robot interactive learning. With the new online emotion recognition system, a 12.5% accuracy increase compared with the online system without interactive robot learning is obtained.