Air quality monitoring plays a vital role in the sustainable development of any country. Continuous monitoring of the major air pollutants and forecasting their variations would be helpful in saving the environment and improving the quality of public health. However, this task becomes challenging with the available observations of air pollutants from the on-ground instruments with their limited spatial coverage. We propose a multimodal deep learning network (M 2 -APNet) to predict major air pollutants at a global scale from multimodal temporal satellite images. The inputs to the proposed M 2 -APNet include satellite image, digital elevation model (DEM), and other key attributes. The proposed M 2 -APNet employs a convolutional neural network to extract local features and a bidirectional long short-term memory to obtain longitudinal features from multimodal temporal data. These features are fused to uncover common patterns helpful for regression in predicting the major air pollutants and categorization of air quality index (AQI). We have conducted exhaustive experiments to predict air pollutants and AQI across important regions in India by employing multiple temporal modalities. Further, the experimental results demonstrate the effectiveness of DEM modality over others in learning to predict major air pollutants and determining the AQI.