Multimodal sensors (visual, non-visual, and wearable) can provide complementary information to develop robust perception systems for recognizing activities accurately. However, it is challenging to extract robust multimodal representations due to the heterogeneous characteristics of data from multimodal sensors and disparate human activities, especially in the presence of noisy and misaligned sensor data. In this work, we propose a cooperative multitask learning-based guided multimodal fusion approach, MuMu, to extract robust multimodal representations for human activity recognition (HAR). MuMu employs an auxiliary task learning approach to extract features specific to each set of activities with shared characteristics (activity-group). MuMu then utilizes activity-group-specific features to direct our proposed Guided Multimodal Fusion Approach (GM-Fusion) for extracting complementary multimodal representations, designed as the target task. We evaluated MuMu by comparing its performance to state-of-the-art multimodal HAR approaches on three activity datasets. Our extensive experimental results suggest that MuMu outperforms all the evaluated approaches across all three datasets. Additionally, the ablation study suggests that MuMu significantly outperforms the baseline models (p