Cyberbullying involves the use of social media platforms to harm or humiliate people online. Victims may resort to self-harm due to the abuse they experience on these platforms, where users can remain anonymous and spread malicious content. This highlights an urgent need for efficient systems to identify and classify cyberbullying. Many researchers have approached this problem using various methods such as binary and multi-class classification, focusing on text, image, or multi-modal data. While deep learning has advanced cyberbullying detection and classification, the multi-class classification of cyberbullying using multi-modal data, such as memes, remains underexplored. This paper addresses this gap by proposing several multi-modal hybrid deep learning models, such as LSTM+ResNet, LSTM+CNN, LSTM+ViT, GRU+ResNet, GRU+CNN, GRU+ViT, BERT+ResNet, BERT+CNN, BERT+ViT, DistilBERT+ResNet, DistilBERT+CNN, DistilBERT+ViT, RoBERTa+ResNet, RoBERTa+CNN, and RoBERTa+ViT, for classifying multi-classes of cyberbullying. The proposed model incorporates a late fusion process, combining the LSTM, GRU, BERT, DistilBERT, and RoBERTa models for text extraction and the ResNet, CNN, and ViT models for image extraction. These models are trained on two datasets: a private dataset, collected from various social media platforms, and a public dataset, obtained from previously published research. Our experimental results demonstrate that the RoBERTa+ViT model achieves an accuracy of 99.20% and an F1-score of 0.992 on the public dataset, and an accuracy of 96.10% and an F1-score of 0.959 on the private dataset when compared with other hybrid models.