Recently, the multimodal information is taken into consideration for ground-based cloud classification in weather station networks, but intrinsic correlations between the multimodal information and the visual information cannot be mined sufficiently. We propose a novel approach called hierarchical multimodal fusion (HMF) for ground-based cloud classification in weather station networks, which fuses the deep multimodal features and the deep visual features in different levels, i.e., low-level fusion and high-level fusion. The low-level fusion directly fuses the heterogeneous features, which focuses on the modality-specific fusion. The high-level fusion integrates the output of low-level fusion with deep visual features and deep multimodal features, which could learn complex correlations among them owing to the deep fusion structure. We employ one loss function to train the overall framework of the HMF so as to improve the discrimination of cloud representations. The experimental results on the MGCD dataset indicate that our method outperforms other methods, which verifies the effectiveness of the HMF in ground-based cloud classification. INDEX TERMS Weather station networks, ground-based cloud classification, hierarchical multimodal fusion, convolutional neural network.