South Asian megacities are significant contributors to the degrading air quality. In highly populated norther n India, Delhi is a major hotspot for air pollutants that influence health and climate. Effective mitigation of air pollution is impeded by inadequate estimation which emphasizes the need for cost-effective alternatives. This paper proposes an ensemble model based on transformer and Convolutional Neural Network (CNN) models to estimate air quality from images and weather parameters in Delhi. A Data Efficient Image transformer (DeiT) is fine-tuned with outdoor images, and parallelly dark-channel prior extracted from images are fed to a CNN model. Additionally, a 1-dimensional CNN is trained with meteorological features to improve accuracy. The predictions from these three parallel branches are then fused with ensemble learning to classify images into six Air Quality Index (AQI) classes and estimate the AQI value. To train and validate the proposed model, an image dataset is collected from Delhi, India termed 'AirSetDelhi' and properly labeled with ground-truth AQI values. Experiments conducted on the dataset demonstrate that the proposed model outperforms other deep learning networks in the literature. The model achieved an overall accuracy of 89.28% and a Cohen Kappa score of 0.856 for AQI classification, while it obtained an RMSE of 47.36 and an R 2 value of 0.861 for AQI estimation, demonstrating efficacy in both tasks. As a regional estimation model based on images and weather features, the proposed model offers an alternative feasible approach for air quality estimation.