In aerial image object detection, how to efficiently detect different size objects in input images of different scales and obtain a unified multi-scale representation of the object is an important issue. Existing methods rarely consider the connection between multi-scale training and multi-scale inference, and do not well optimize the constraint of input object samples in the multi-scale training process, which limits the performance of multi-scale representation. In this study, an efficient object detection algorithm for aerial images is proposed to alleviate this problem. Firstly, we propose to use metric learning to obtain the scale representation boundary of each object class, reduce the support of indistinguishable objects at extreme scales in the training process, and enhance the effect of multi-scale representation. Secondly, indistinguishable small objects are merged into small object regions, and these regions are trained to recommend the detector to detect small objects on the following high-resolution scale. Thus, a reasonable association between multi-scale training and inference is established, and the efficiency of multiscale inference is considerably improved. The proposed algorithm has been tested on three popular aerial image datasets, including VisDrone, DOTA and UAVDT. Experimental results show that it can improve the detection accuracy and reduce the number of processing pixels.INDEX TERMS aerial images, extreme scale, metric learning, object detection.