In this work, an end-to-end framework is proposed to achieve multilingual automatic speech recognition (ASR) in air traffic control (ATC) systems. Considering the standard ATC procedure, a recurrent neural network (RNN) based framework is selected to mine the temporal dependencies among speech frames. Facing the distributed feature space caused by the radio transmission, a hybrid feature embedding block is designed to extract high-level representations, in which multiple convolutional neural networks are designed to accommodate different frequency and temporal resolutions. The residual mechanism is performed on the RNN layers to improve the trainability and the convergence. To integrate the multilingual ASR into a single model and relieve the class imbalance, a special vocabulary is designed to unify the pronunciation of the vocabulary in Chinese and English, i.e., pronunciation-oriented vocabulary. The proposed model is optimized by the connectionist temporal classification loss and is validated on a real-world speech corpus (ATC-Speech). A character error rate of 4.4% and 5.9% is achieved for Chinese and English speech, respectively, which outperforms other popular approaches. Most importantly, the proposed approach achieves the multilingual ASR task in an end-to-end manner with considerable high performance.
INTRODUCTIONAir traffic control (ATC) is an essential service provided by ground-based air traffic controllers (ATCOs) to guide the flight to be operated in a safe manner (i.e. prevent conflict), and further to organize and expedite the traffic flow. As the primary communication way between the ATCO and the aircrew, the spoken instruction through the very high frequency (VHF) radio transmission implies a wealth of contextualized situational information, which is important to the real-time ATC decisionmaking. In the current ATC management system, the ATC is a non-automatic procedure (human-in-the-loop) and is always regarded as a potential risk for the air traffic operation [1]. Numerous studies have demonstrated that monitoring the control conversation is a promising way to obtain real-time traffic dynamics [2,3,4], which benefits to formulate a closed-loop ATC management. To this end, the automatic speech recognition (ASR) technique, with the purpose of building the bridgeThis is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.