In deep learning based vision tasks, improving multiscale representation by combining shallow and deep features has consistently led to performance gains across a wide range of applications. However, significant discrepancies in both scale and semantic content often occur during the fusion of shallow and deep features. Most existing approaches rely on standard convolutional structures for representing multiscale features, which may not fully capture the complexity of the underlying data. To address this, we propose a novel deep-multiscale stratified aggregation (D-MSA) module, which could improve the extraction and fusion of multiscale features by efficiently aggregating features across multiple receptive fields. The novel D-MSA module was integrated into the YOLO architecture to enhance the capacity for processing complex multiscale features. Experiments on the PASCAL VOC 2012 dataset demonstrate that D-MSA could effectively handle complex multiscale features while improving computational efficiency, making it suitable for object detection in challenging environments.