Purpose: This literature review aims to identify Multimodal Emotion Recognition (MER) in depth and breadth by analysing the topics, trends, modalities, and other supporting sources discussed in research over the years and between 2010 and 2022. Based on the screening analysis, a total of 14,533 articles were analysed to achieve this goal.Methods: This research was conducted in 3 (three) phases, including Planning, Conducting and Reporting. The first step was defining the research objectives by searching for systematic reviews with similar topics to this study, then reviewing them to develop research questions and systematic review protocols for this study. The second stage is to collect articles according to a pre-determined protocol, selecting the articles obtained and then conducting an analysis of the filtered articles in order to answer the research questions. The final stage is to summarize the results of the analysis so new findings from this research can be reported. Result: In general, the focus of MER research can be categorised into two issues, namely the object background and the source or modality of emotion recognition. When looking at the object background, most of the 55% to support emotion recognition with a health background, especially brain function decline, 34% based on age, 10% based on gender, 1% data collection situation and a small portion of less than 1% related to ethnic culture. In terms of the source of emotion recognition, research is divided into electromagnetic signals, voice signals, text, photo/video and the development of wearable devices. Based on the above results, there are at least 7 scientific fields that discuss MER research, namely health, psychology, electronics, grammar, communication, socio-culture and computer science.Novelty: MER research has the potential to develop further. There are still many areas that have received less attention, while the ecosystem that uses them has grown massively. Emotion recognition modalities are numerous and diverse, but research is still focused on validating the emotions of each modality, rather than exploring the strengths of each modality to improve the quality of recognition results.