In recent years, drones have become a fundamental tool for searching for missing persons in the wild and at sea by collecting aerial images. Search and Rescue operators inspect these images in real-time, aiming to spot the missing persons. However, a tradeoff exists between using large area images to cover extensive regions quickly, risking missing the target, and flying at a lower altitude for easier detection but requiring processing of a larger image set. This work addresses automatic person detection in aerial images. We propose the TPH-YOLOv7t by holdout method, a new Transformer Prediction Head for YOLOv7-tiny with an Efficient Joint Attention Module and a Convolutional Block Attention Module. Results show our method’s robustness compared to YOLOv7-tiny, achieving a mean average precision (mAP50) of over 0.65 and an inference speed of 130fps on a single GPU Nvidia RTX2080Ti.