High-resolution remote sensing image object detection plays an increasingly important role in image processing and interpretation. The application of region-based convolutional neural network (R-CNN) greatly enhances the performance of object detection. However, the attributes of remote sensing images such as overlarge image size, similar background, disequilibrium distribution of categories make this task more challenging. The previous works have focused on extracting multi-scale features of region proposals, often ignoring the quality of region of interest (ROI). In this work, we proposed a patch-based three-stage aggregation network (PTAN) for object detection in high-resolution remote sensing images. It consists of a three-stage cascade structure that sequentially improves the quality of candidate regions by increasing the IoU threshold stage by stage, and adopts a resampling strategy to obtain sufficient region proposals. At the same time, we also proposed patch-based strategy and applied it to the framework during training and inference. Ablation experiments and comprehensive evaluations on a communal remote sensing image object detection dataset DOTA demonstrate the effectiveness and robustness of the proposed framework, which obtained a mean average precision (mAP) value of 0.7958 on validation dataset and a front-rank mAP of 0.7858 on testing dataset. On another remote sensing image object detection dataset NWPU VHR-10, the proposed PTAN obtained a mAP value of 0.9187, outperforming other five object detectors.