Urban activities, particularly vehicle traffic, are contributing significantly to environmental pollution with detrimental effects on public health. The ability to anticipate air quality in advance is critical for public authorities and the general public to plan and manage these activities, which ultimately help in minimizing the adverse impact on the environment and public health effectively. Thanks to recent advancements in Artificial Intelligence and sensor technology, forecasting air quality is possible through the consideration of various environmental factors. This paper presents our novel solution for air quality prediction and its correlation with different environmental factors and urban activities, such as traffic density. To this aim, we propose a multi-modal framework by integrating real-time data from different environmental sensors and traffic density extracted from Closed Circuit Television footage. The framework effectively addresses data inconsistencies arising from sensor and camera malfunctions within a streaming dataset. The dataset exhibits real-world complexities, including abrupt camera or station activations/deactivations, noise interference, and outliers. The proposed system tackles the challenge of predicting air quality at locations having no sensors or experiencing sensor failures by training a joint model on the data obtained from nearby stations/sensors using a Particle Swarm Optimization (PSO)-based merit fusion of the sensor data. The proposed methodology is evaluated using various variants of the LSTM model including Bi-directional LSTM, CNN-LSTM, and Convolutions LSTM (ConvLSTM) obtaining an improvement of 48%, 67%, and 173% for short-term, medium-term, and long-term periods, respectively, over the ARIMA model.