With the remarkable advances of unmanned aerial vehicles (UAVs) and machine vision, aerial tracking has attracted wide attention from scholars. Previous tracking methods were mostly implemented in clean and well-lit environments, making it challenging to track camouflaged people rapidly and accurately in woodlands. We develop a framework for camouflaged people aerial tracking (CPAT) based on transformer. Specifically, a camouflaged people discovery strategy is proposed to rapidly generate training samples from the unlabeled videos captured by the UAV. Dynamic programming is also employed to filter noises to generate smooth candidate frames. To exploit multilevel feature information, a transformer fusion framework is designed to integrate shallow spatial information and in-depth semantic features. For reducing computing consumption, the spatial attention reduction mechanism is embedded in the multihead attention for fast tracking. Further, we build a dataset for evaluating the effect of camouflaged people tracking called Cam235, which consists of 85 manually labeled test sequences and more than 100k frames of the unlabeled training set. Exhaustive experiments on Cam235-test and popular tracking datasets prove that the CPAT is superior to other trackers for practical application. Under the most challenging condition of camouflaged people tracking, the CPAT achieves the precision of 67.9%, surpassing the state-of-the-art trackers by large margins.