In smart homes, data generated from real-time sensors for human activity recognition is complex, noisy and imbalanced. It is a significant challenge to create machine learning models that can classify activities which are not as commonly occurring as other activities. Machine learning models designed to classify imbalanced data are biased towards learning the more commonly occurring classes. Such learning bias occurs naturally, since the models better learn classes which contain more records. This paper examines whether fusing real-world imbalanced multi-modal sensor data improves classification results as opposed to using unimodal data; and compares deep learning approaches to dealing with imbalanced multi-modal sensor data when using various resampling methods and deep learning models. Experiments were carried out using a large multi-modal sensor dataset generated from the Sensor Platform for HEalthcare in a Residential Environment (SPHERE). The data comprises 16104 samples, where each sample comprises 5608 features and belongs to one of 20 activities (classes). Experimental results using SPHERE demonstrate the challenges of dealing with imbalanced multi-modal data and highlight the importance of having a suitable number of samples within each class for sufficiently training and testing deep learning models. Furthermore, the results revealed that when fusing the data and using the Synthetic Minority Oversampling Technique (SMOTE) to correct class imbalance, CNN-LSTM achieved the highest classification accuracy of 93.67% followed by CNN, 93.55%, and LSTM, i.e. 92.98%.