Interaction between humans is a multimodal process in nature, which integrates the different modalities of vision, hearing, gesture and touch. It is not surprising then that conventional uni-modal human-machine interactions lag in performance, robustness and naturalness when compared with human-human interactions. Recently, there has been increasing research interest in jointly processing information in multiple modalities and mimicking human-human multimodal interactions [2,4,5,9,13,14,16,18,19,21,22]. For example, human speech production and perception are bimodal in nature: visual cues have a broad influence on perceived auditory stimuli [17]. The latest research in speech processing has shown that integration of both auditory and visual information, i.e., acoustic speech combined with facial and lip motions, achieves significant performance improvement in tasks such as speech recognition [3] and emotion recognition [16]. However, information from difference modalities can be difficult to integrate: the representations of different signals are heterogeneous and the signals are often asynchronous and loosely-coupled. Therefore, finding more effective ways to integrate and jointly process information from different modalities is essential for the success of such multimodal human-machine interactive systems. Meanwhile, multimodal applications are becoming increasingly important, especially for mobile computing and digital entertainment scenarios, such as natural user interfaces (NUI) on smartphones and motion sensing gaming.This special issue aims to bring together work by researchers and technologists engaged in the development of multimodal technologies for information processing, emerging multimedia applications and user-centric human computer interaction. We received more than L. Xie ( )