This book is an introduction to multimodal signal processing. In it, we use the goal of building applications that can understand meetings as a way to focus and motivate the processing we describe. Multimodal signal processing takes the outputs of capture devices running at the same time -primarily cameras and microphones, but also electronic whiteboards and pens -and automatically analyses them to make sense of what is happening in the space being recorded. For instance, these analyses might indicate who spoke, what was said, whether there was an active discussion, and who was dominant in it. These analyses require the capture of multimodal data using a range of signals, followed by a low-level automatic annotation of them, gradually layering up annotation until information that relates to user requirements is extracted.Multimodal signal processing can be done in real time, that is, fast enough to build applications that influence the group while they are together, or offline -not always but often at higher quality -for later review of what went on. It can also be done for groups that are all together in one space, typically an instrumented meeting room, or for groups that are in different spaces but use technology such as video-conferencing to communicate. The book thus introduces automatic approaches to capturing, processing and ultimately understanding human interaction in meetings, and describes the state-of-the-art for all technologies involved.Multimodal signal processing raises the possibility of a wide range of applications that help groups improve their interactions and hence their effectiveness between or during meetings. However, developing applications has required improvements in the technological state-of-theart in many arenas.The first comprises core technologies like audio and visual processing and recognition that tell us basic facts such as who was present and what words were said. On top of this information comes processing that begins to make sense of a meeting in human terms. Part of this is simply combining different sources of information into a record of who said what, when, and to whom, but it is often also useful, for instance, to apply models of group dynamics from the behavioral and social sciences in order to reveal how a group interacts, or to abstract and summarize the meeting content overall. Finding ways to integrate the varying analyses required for a particular meeting support application has been a major new challenge.Finally, moving from components that model and analyze multimodal human-to-human communication scenes to real-world applications has required careful user requirements capture,