This paper addresses the problem of feature fusion between smile, as a visual feature, and text, as a transcription result. The influence of smile over semantic data has been considered before, without investigating multiple approaches for the fusion. This problem is multi-modal, which makes it more difficult.The goal of this article is to investigate how this fusion could increase the current interactivity of a dialogue system by boosting the automatic detection rate of the sentiments expressed by a human user.There are two original propositions in our approach. The first lies in the use of a segmented detection for text data, rather than predicting a single label for every document (video). Second, this paper studies the importance of several features in the process of multi-modal fusion.Our approach uses basic features, such as NGrams, Smile Presence or Valence to find the best fusion approach. Moreover, we test a two level classification approach, using a SVM.