Multi-scale object detection is a basic challenge in computer vision. Although many advanced methods based on convolutional neural networks have succeeded in natural images, the progress in aerial images has been relatively slow mainly due to the considerably huge scale variations of objects and many densely distributed small objects. In this paper, considering that the semantic information of the small objects may be weakened or even disappear in the deeper layers of neural network, we propose a new detection framework called Extended Feature Pyramid Network (EFPN) for strengthening the information extraction ability of the neural network. In the EFPN, we first design the multi-branched dilated bottleneck (MBDB) module in the lateral connections to capture much more semantic information. Then, we further devise an attention pathway for better locating the objects. Finally, an augmented bottom-up pathway is conducted for making shallow layer information easier to spread and further improving performance. Moreover, we present an adaptive scale training strategy to enable the network to better recognize multi-scale objects. Meanwhile, we present a novel clustering method to achieve adaptive anchors and make the neural network better learn data features. Experiments on the public aerial datasets indicate that the presented method obtain state-of-the-art performance.Remote Sens. 2020, 12, 784 2 of 22 high-level semantic information at each scale. This structure displays an obvious improvement as a common feature extractor in some practical applications. However, since large-scale objects are usually produced and predicted in the deeper convolution layers of the FPN, the boundaries of these objects might be too fuzzy to obtain accurate regression. Furthermore, the FPN usually predicts small-scale objects in the shallower layers with low semantic information which might not be enough to identify the class of the objects. The designer of the FPN has been aware of this problem and adopted a top-down structure with lateral connections to fuse shallow layers and high-level semantic information to relieve it. However, if the small-scale objects disappear in the deep convolution layers, the context information cues will disappear at the same time.