Smart devices such as smartphones, smartwatches, etc. are promising platforms that are being used for automatic recognition of human activities. However, it is difficult to accurately monitor complex human activities due to inter-class pattern similarity, which occurs when different human activities exhibit similar signal patterns or characteristics. Current smartphone-based recognition systems depend on the traditional sensors such as accelerometer and gyroscope, which are inbuilt in these devices. Therefore, apart from using information from the traditional sensors, these systems lack contextual information to support automatic activity recognition. In this article, we explore environment contexts such as illumination(light conditions) and noise level to support sensory data obtained from the traditional sensors using a hybrid of Convolutional Neural Networks and Long Short Time Memory(CNN_LSTM) learning models. The models performed sensor fusion by augmenting the low-level sensor signals with rich contextual data to improve recognition and generalisation ability of the proposed solution. Two sets of experiments were performed to validate the proposed solution. The first set of experiments used inertial sensing data whilst the second set of extensive experiments combined inertial signals with contextual information from environment sensing data. Obtained results demonstrate that contextual information such as environment noise level and illumination using hybrid deep learning models achieved better recognition accuracy than the traditional activity recognition models without contextual information.