Transformer models are evolving rapidly in standard natural language processing tasks; however, their application is drastically proliferating in computer vision (CV) as well. Transformers are either replacing convolution networks or being used in conjunction with them. This paper aims to differentiate the design of convolutional neural networks (CNNs) built models and models based on transformer, particularly in the domain of object detection. CNNs are designed to capture local spatial patterns through convolutional layers, which is well suited for tasks that involve understanding visual hierarchies and features. However, transformers bring a new paradigm to CV by leveraging self-attention mechanisms, which allows to capture both local and global context in images. Here, we target the various aspects such as basic level of understanding, comparative study, application of attention model, and highlighting tremendous growth along with delivering efficiency are presented effectively for object detection task. The main emphasis of this work is to offer basic understanding of architectures for object detection task and motivates to adopt the same in computer vision tasks. In addition, this paper highlights the evolution of transformer-based models in object detection and their growing importance in the field of computer vision, we also identified the open research direction in the same field.