Studies on human-machine interaction system show positive results on system development accuracy. However, there are problems, especially using certain input modalities such as speech, gesture, face detection, and skeleton tracking. These problems include how to design an interface system for a machine to contextualize the existing conversations. Other problems include activating the system using various modalities, right multimodal fusion methods, machine understanding of human intentions, and methods for developing knowledge. This study developed a method of human-machine interaction system. It involved several stages, including a multimodal activation system, methods for recognizing speech modalities, gestures, face detection and skeleton tracking, multimodal fusion strategies, understanding human intent and Indonesian dialogue systems, as well as machine knowledge development methods and the right response. The research contributes to an easier and more natural humanmachine interaction system using multimodal fusion-based systems. The average accuracy rate of multimodal activation, testing dialogue system using Indonesian, gesture recognition interaction, and multimodal fusion is 87.42%, 92.11%, 93.54% and 93%, respectively. The level of user satisfaction towards the multimodal recognition-based human-machine interaction system developed was 95%. According to 76.2% of users, this interaction system was natural, while 79.4% agreed that the machine responded well to their wishes.