2 Chapter 11 Multimodal Gesture Recognition Figure 11.1 Screenshot from the "Put That There!" demonstration video by the Architecture Machine Group at MIT [Bolt 1980]. ers [Kinect 2016]. Such advancements have led to intensified efforts to integrate multimodal gesture interfaces in real-life applications. Indeed, the field of multimodal gesture recognition has been attracting increasing interest, being driven by novel HCI paradigms on a continuously expanding range of devices equipped with multimodal sensors and ever-increasing computational power, for example smartphones and smart television sets. Nevertheless, the capabilities of modern multimodal gesture systems remain limited. In particular, the set of gestures accounted for in typical setups is mostly constrained to pointing gestures, a number of emblematic ones like an open palm, and gestures corresponding to some sort of interaction with a physical object, e.g., pinching for zooming. At the same time, fusion with speech remains in most cases just an experimental feature. When compared to the abundance and variety of gestures and their interaction with speech in natural human communication, it clearly seems that there is still a long way to go for the corresponding HCI research and development [Kopp 2013]. Multimodal gesture recognition constitutes a wide multidisciplinary field. This chapter makes an effort to provide a comprehensive overview of it, both in theoretical and application terms. More specifically, basic concepts related to gesturing, the multifaceted interplay of gestures and speech, and the importance of gestures in HCI are discussed in Section 11.2. An overview of the current trends in the field of multimodal gesture recognition is provided in Section 11.3, separately focusing on gestures, speech, and multimodal fusion. Further, a stateof-the-art recognition setup developed by the authors is described in detail in Section 11.4, in order to facilitate a better understanding of all practical considerations involved in such a system. In closing, the future of multimodal gesture recognition and related challenges are discussed in Section 11.5. Finally, a set of Focus Questions to aid comprehension of the material is also provided. 11.2 Multimodal Communication and Gestures 3 Glossary Terminology for Understanding Multimodal Gesture Recognition Co-speech gestures are gestures produced while speaking. Their interplay with speech and their role in human interaction are discussed in Section 11.2.2. Gesture recognition models are primarily statistical constructs that can be trained to represent specific gestures based on corresponding data. In this chapter, traditional hidden Markov models (HMMs) with Gaussian mixture model (GMM) observation probabilities are considered, as part of the authors' gesture recognition system detailed in Section 11.4. Further, a number of models based on deep learning approaches are overviewed in Section 11.3.1. A more elaborate discussion of deep learning for multimodal interaction modeling can be found in the second vol...