Vehicle detection using aerial thermal infrared images has received significant attention because of its strong capability for day and night observations to supply information for vehicle tracking, traffic monitoring, and road network planning. Compared with aerial visible images, aerial thermal infrared images are not sensitive to lighting conditions. However, they have low contrast and blurred edges. Therefore, a combinational and sparse you-only-look-once (ComS-YOLO) neural network is put forward to accurately and quickly detect vehicles in aerial thermal infrared images. Therein, we adjust the structure of the deep neural network to balance the detection accuracy and running time. In addition, we propose an objective function that utilizes the diagonal distance of the corresponding minimum external rectangle, which prevents non-convergence when there is an inclusion relationship between the prediction and true boxes or in the case of width and height alignment. Furthermore, to avoid over-fitting in the training stage, we eliminate some redundant parameters via constraints and on-line pruning. Finally, experimental results on the NWPU VHR-10 and DARPA VIVID datasets show that the proposed ComS-YOLO network effectively and efficiently identifies the vehicles with a low missed rate and false detection rate. Compared with the Faster R-CNN and a series of YOLO neural networks, the proposed neural network presents satisfactory and competitive results in terms of the detection accuracy and running time. Furthermore, vehicle detection experiments under different environments are also carried out, which shows that our method can achieve an excellent and desired performance on detection accuracy and robustness of vehicle detection.