Due to increased air traffic flow, air traffic controllers (ATCs) operate in a state of high load or even overload for long periods of time, which can seriously affect the reliability and efficiency of controllers’ commands. Thus, the early identification of ATCs who are overworked is crucial to the maintenance of flight safety while increasing overall flight efficiency. This study uses a comprehensive comparison of existing cognitive load assessment methods combined with the characteristics of the ATC as a basis from which a method for the utilization of speech parameters to assess cognitive load is proposed. This method is ultimately selected due to the minimal interference of the collection equipment and the abundance of speech signals. The speech signal is pre-processed to generate a Mel spectrogram, which contains temporal information in addition to energy, tone, and other spatial information. Therefore, a speech cognitive load evaluation model based on a stacked convolutional neural network (CNN) and the Transformer encoder (SCNN-TransE) is proposed. The use of a CNN and the Transformer encoder allows us to extract spatial features and temporal features, respectively, from contextual information from speech data and facilitates the fusion of spatial features and temporal features into spatio-temporal features, which improves our method’s ability to capture the depth features of speech. We conduct experiments on air traffic control communication data, which show that the detection accuracy and F1 score of SCNN-TransE are better than the results from the support-vector machine (SVM), k-nearest neighbors (KNN), random forest (RF), adaptive boosting (AdaBoost), and stacked CNN parallel long short-term memory with attention (SCNN-LSTM-Attention) models, reaching values of 97.48% and 97.07%, respectively. Thus, our proposed model can realize the effective evaluation of cognitive load levels.