With the advancements in rising computer vision processing, Transformer has attracted increasing interesting in this field. However, it is limited because of its unprecedented storage, heavy reliance on data size and intolerable computational power consumption. While lightweight network is in other extreme, pursuing the compact architectures accompanied by performance loss. In this paper, we enhance an architecture as the backbone of object detection networks through combining right-size Transformer, i.e. Vision Transformer module. Specifically, based on GhostNet, a well-known lightweight neural network structure moreover, embed this Vision Transformer module at the end of GhostNet, and use the input data with slicing design to reduce the computational burden of the neural networks. Vision Transformer is taken to enhance the architecture as the backbone of object detection networks, and the well-known YOLOv5 as the baseline. We conduct multi-metric comparison experiments on two medium-scale object detection datasets with large, medium and small scale networks. Results show that without relying on ultra-large dataset and pre-trained models, the proposed Transformer module enhanced architecture achieves comparable or even higher mAP metrics with only half of the model size and floating-point computation of the baseline.