Nowadays, air pollution is an important problem with negative impacts on human health and on the environment. The air pollution forecast can provide important information to all affected sides, and allows appropriate measures to be taken. In order to address the problems of filling in the missing values in the time series used for air pollution forecasts, the automation of the allocation of optimal subset of input variables, the dependency of the air quality at a particular location on the conditions of the surrounding environment, as well as automation of the model’s optimization, this paper proposes a deep spatiotemporal model based on a 2D convolutional neural network and a long short-term memory network for predicting air pollution. The model utilizes the automatic selection of input variables and the optimization of hyperparameters by a genetic algorithm. A hybrid strategy for missing value imputation is used based on a combination of linear interpolation and a strategy of using the average between the previous value and the average value for the same time in other years. In order to determine the best architecture of the spatiotemporal model, the architecture hyperparameters are optimized by a genetic algorithm with a modified crossover operator for solutions with variable lengths. Additionally, the trained models are included in various ensembles in order to further improve the prediction performance—these include ensembles of models with the same architecture comprising the best architecture obtained by the evolutionary optimization, and ensembles of diverse models comprising the k best models of the evolutionary optimization. The experimental results for the Beijing Multi-Site Air-Quality Data Set show that the proposed spatiotemporal model for air pollution forecasting provides good and consistent prediction results. The comparison of the suggested model with other deep NN models shows satisfactory results, with the best performance according to MAE, based on the experimental results for the station at Wanliu (16.753 ± 0.384). Most of the model architectures obtained by the optimization of the model hyperparameters using the genetic algorithm have one convolutional layer with a small number of kernels and a small kernel size; the convolutional layers are followed by a max-pooling layer, and one or two LSTM layers are utilized with dropout regularization applied to the LSTM layer using small values of p (0.1, 0.2 and 0.3). The utilization of ensembles from the k best trained models further improves the prediction results and surpasses other deep learning models, according to MAE and RMSE metrics. The used hybrid strategy for missing value imputation enhances the results, especially for data with clear seasonality, and produces better MAE compared to the strategy using average values for the same hour of the same day and month in other years. The experimental results also reveal that random searching is a simple and effective strategy for selecting the input variables. Furthermore, the inclusion of spatial information in the model’s input data, based on the local neighborhood data, significantly improves the predictive results obtained with the model. The results obtained demonstrate the benefits of including spatial information from as many surrounding stations as possible, as well as using as much historical information as possible.