Small-scale low-altitude unmanned aerial vehicles (UAVs) equipped with perception capability for military targets will become increasingly essential for strategic reconnaissance and stationary patrols in the future. To respond to challenges such as complex terrain and weather variations, as well as the deception and camouflage of military targets, this paper proposes a hybrid detection model that combines Convolutional Neural Network (CNN) and Transformer architecture in a decoupled manner. The proposed detector consists of the C-branch and the T-branch. In the C-branch, Multi-gradient Path Network (MgpNet) is introduced, inspired by the multi-gradient flow strategy, excelling in capturing the local feature information of an image. In the T-branch, RPFormer, a Region–Pixel two-stage attention mechanism, is proposed to aggregate the global feature information of the whole image. A feature fusion strategy is proposed to merge the feature layers of the two branches, further improving the detection accuracy. Furthermore, to better simulate real UAVs’ reconnaissance environments, we construct a dataset of military targets in complex environments captured from an oblique perspective to evaluate the proposed detector. In ablation experiments, different fusion methods are validated, and the results demonstrate the effectiveness of the proposed fusion strategy. In comparative experiments, the proposed detector outperforms most advanced general detectors.