Smart home monitoring systems via internet of things (IoT) are required for taking care of elders at home. They provide the flexibility of monitoring elders remotely for their families and caregivers. Activities of daily living are an efficient way to effectively monitor elderly people at home and patients at caregiving facilities. The monitoring of such actions depends largely on IoT-based devices, either wireless or installed at different places. This paper proposes an effective and robust layered architecture using multisensory devices to recognize the activities of daily living from anywhere. Multimodality refers to the sensory devices of multiple types working together to achieve the objective of remote monitoring. Therefore, the proposed multimodal-based approach includes IoT devices, such as wearable inertial sensors and videos recorded during daily routines, fused together. The data from these multi-sensors have to be processed through a pre-processing layer through different stages, such as data filtration, segmentation, landmark detection, and 2D stick model. In next layer called the features processing, we have extracted, fused, and optimized different features from multimodal sensors. The final layer, called classification, has been utilized to recognize the activities of daily living via a deep learning technique known as convolutional neural network. It is observed from the proposed IoT-based multimodal layered system’s results that an acceptable mean accuracy rate of 84.14% has been achieved.