The discovery of Microsoft Kinect sensors opened a new research direction to human action recognition (HAR) from videos. However, the depth maps and body postures are of noisy and less reliable in their original form. Hence, they contribute limited contribution toward the actions recognition, especially in real time scenarios. Moreover, the HAR from single data modality have several limitations. Hence, this paper proposes a multi-modal HAR framework based on a simple and effective deep learning model. The proposed model considered depth maps and body postures as input and describes each action through two newly proposed descriptors namely improved motion history image (IMHI) and spatio-temporal posture descriptor (STPD). IMHI removes undefined motion regions and STPD ensure the provision of complete Spatio-temporal motion information to the training system. Alongside, a new temporal segmentation is also proposed to ensure robustness against speed variations. Finally, different fusion rules are adapted to determine the correct action based on different policies. We conduct extensive Simulations on three benchmark standard datasets namely MSRAction3D, MAD and PKU-MMD and the average accuracy obtained is observed as 95.2333%, 91.9945% and 93.3141% respectively. Experimental results prove that the proposed framework is discriminative for the actions with similar movements.