Ground-based cloud images contain a wealth of cloud information and are an important part of meteorological research. However, in practice, ground cloud images must be segmented and classified to obtain the cloud volume, cloud type and cloud coverage. Existing methods ignore the relationship between cloud segmentation and classification, and usually only one of these is studied. Accordingly, our paper proposes a novel method for the joint classification and segmentation of cloud images, called CloudY-Net. Compared to the basic Y-Net framework, which extracts feature maps from the central layer, we extract feature maps from four different layers to obtain more useful information to improve the classification accuracy. These feature maps are combined to produce a feature vector to train the classifier. Additionally, the multi-head self-attention mechanism is implemented during the fusion process to enhance the information interaction among features further. A new module called Cloud Mixture-of-Experts (C-MoE) is proposed to enable the weights of each feature layer to be automatically learned by the model, thus improving the quality of the fused feature representation. Correspondingly, experiments are conducted on the open multi-modal ground-based cloud dataset (MGCD). The results demonstrate that the proposed model significantly improves the classification accuracy compared to classical networks and state-of-the-art algorithms, with classification accuracy of 88.58%. In addition, we annotate 4000 images in the MGCD for cloud segmentation and produce a cloud segmentation dataset called MGCD-Seg. Then, we obtain a 96.55 mIoU on MGCD-Seg, validating the efficacy of our method in ground-based cloud imagery segmentation and classification.