Multi-label classification (MLC) of aerial images is a crucial task in remote sensing image analysis. Traditional image classification methods have limitations in image feature extraction, leading to an increasing use of deep learning models, such as convolutional neural networks (CNN) and vision transformers (ViT). However, the standalone use of these models may have limitations when dealing with MLC. To enhance the generalization performance of MLC of aerial images, this paper combines two CNN and two ViT models, comparing four single deep learning models, a manually weighted ensemble learning method, and a GA-based weighted ensemble method. The experimental results using two public multi-label aerial image datasets show that the classification performance of ViT models is better than CNN models, the traditional weighted ensemble learning model performs better than a single deep learning model, and the GA-based weighted ensemble method performs better than the manually weighted ensemble learning method. The GA-based weighted ensemble method proposed in this study can achieve better MLC performance of aerial images than previous results.