Inferring gaze targeting or gaze following is an effective method for understanding human behavior and intentions. To address errors caused by issues such as the disappearance of the eye image and head deflection occlusion in image capture, this paper adopts a non-intrusive appearance-based tracking technique using a binocular stereo vision camera to capture the face image and the head pose. It determines each gaze direction based on a single image frame. Furthermore, to handle head motion and view direction effectively and improve the classification and detection ability of the gaze target region, this paper proposes a hybrid structure for the Swin Transformer gaze target region classification method. Initially, both the ResNet50 model and Swin Transformer model are employed to extract facial image features, and subsequently, head pose features are fused to classify the gaze target area. The paper also compares the classification effects of different structural models. The analysis of the results demonstrates that the hybrid Swin Transformer model is more effective in classifying and detecting the gaze target region, achieving an accuracy rate of 90%. Lastly, the paper analyzes the gaze of flight trainees during flight missions using a heatmap, which serves as a foundation for subsequent analyses of pilot attention and operational intentions during flights.INDEX TERMS Gaze estimation, swin transformer, computer vision, region classification.