“…e back-end network is composed of three branches; each branch contains the dilation convolution with different expansion factors, and the expansion factors are 1, 2, and 4. e branch with expansion factor of 1 is used to capture the features of small-scale objects, while the other branches expand the perception range to capture the features of large-scale objects. As mentioned in literature [17], it is difficult for independent branches to learn the characteristics of different patterns, which leads to parameter redundancy. erefore, in this paper, the feature maps of each branch network are concatenated in each layer, and 1 × 1 convolution is used for cross-channel feature aggregation to strengthen the information interaction between each branch, so as to make full use of the complementarity of each branch extraction feature to make the output feature map has more expressive power and scale diversity.…”