Object detection is a critical and demanding topic in the subject of processing satellite and airborne images. The targets acquired in remote sensing imagery are at various sizes, and the backgrounds are complicated, which makes object detection extremely challenging. We address these aforementioned issues in this paper by introducing the MashFormer, an innovative multi-scale aware CNN and Transformer integrated hybrid detector. Specifically, MashFormer employs the transformer block to complement the convolutional neural network (CNN) based feature extraction backbone, which could obtain the relationships between long-range features and enhance the representative ability in complex background scenarios. With the intention of improving the detection performance for objects with multi-scale characteristic, since in remote sensing scenarios, the size of object varies greatly. A multi-level feature aggregation component, incoperate with a cross-level feature alignment module is designed to alleviate the semantic discrepancy between features from shallow and deep layers. To verify the effectiveness of the suggested MashFormer, comparative experiments are carried out with other cutting-edge methodologies using the publicly available High Resolution Remote Sensing Detection (HRRSD) and Northwestern Polytechnical University (NWPU) VHR-10 datasets. The experimental findings confirm the effectiveness and superiority of our suggested model by indicating that our approach has greater mean average precision (mAP) than the other methodologies.