Emotions play an essential role in human life for planning and decision making. Emotion identification and recognition is a widely explored field in the area of artificial intelligence and affective computing as a means of empathizing with humans and thereby improving human machine interaction. Though audio visual cues are vital for recognizing human emotions, they are sometimes insufficient in identifying emotions of people who are good at hiding emotions or people suffering from Alexithymia. Considering other dimensions like Electroencephalogram (EEG) or text, along with audio visual cues can aid in improving the results in such situations. Taking advantage of the complementarity of multiple modalities normally helps capture emotions more accurately compared to single modality. However, to achieve precise and accurate results, correct fusion of these multimodal signals is solicited. This study provides a detailed review of different multimodal fusion techniques that can be used for emotion recognition. This paper proposes in-depth study of feature-level fusion, decision-level fusion and hybrid fusion techniques for identifying human emotions based on multimodal inputs and compare the results. The study concentrates on three different modalities i.e., facial images, audio and text for experimentation; at least one of which differs in temporal characteristics. The result suggests that hybrid fusion works best in combining multiple modalities which differ in time synchronicity.