The Internet of Things (IoT) technologies such as interconnection and edge computing help emotion recognition to be applied in healthcare, smart education, etc. However, the acquisition and transmission processes may have some situations, such as lost signals and serious interference noise caused by motion, which affect the quality of the received data and limit the performance of IoT emotion detection. We collectively refer to these as invalid data. A multi-step deep (MSD) system is proposed to reliably detect multimodal emotion by the collected records containing invalid data. Semantic compatibility and continuity are utilized to filter out the invalid data. The feature from invalid modal data is replaced through the imputation method to compensate for the impact of invalid data on emotion detection. In this way, the proposed system can automatically process invalid data and improve the recognition performance. Furthermore, considering the spatiotemporal information, the features of video and physiological signals are extracted by specific deep neural networks in the MSD system. The simulation experiments are conducted on a public multimodal database, and the performance of the MSD system measured by the unweighted average recall is better than that of the traditional system. The promising results observed in the experiments verify the potential influence of the proposed system in practical IoT applications. INDEX TERMS Internet of Things, multimodal emotion detection, invalid data, multi-step deep (MSD) system, deep neural networks