Object detection on fused images of visible and infrared modals is of great importance for many applications, for example, surveillance and rescue at low‐light conditions. However, current detectors have difficulty for robust fused image detection for mainly two reasons. First, objects are presented in various shapes and sizes, making some hard samples cannot be localized accurately. Second, the same object category in the fused images will have different appearance due to changing weather condition, temperature and intrinsic heat. Such a contradiction will degrade the classification task of a detection network, since it cannot merge commonalities and distinguish differences well. In this paper, we propose to reconstruct the detection pipeline of current detectors, and enhance the detection ability on difficult samples in fused images. Specifically, a Dilation Pyramid Network (DPN) is designed at the lateral connection to generate and aggregate features of various receptive field, without increasing pyramid layers. To strengthen the classification, a Semantic Category Attention Module (SCAM) is proposed to capture attention centers of semantics in fused images, rather than object centers. Abundant experiments on two fusion datasets show that the proposed method achieves a satisfying performance, and both modules can greatly improve current generic detectors on fused images.