“…The main reason for the deficient localization in SOD stems from the limited information provided in the input image or video frame, compounded by the subsequent spatial degradation experienced as they pass through multiple layers in deep networks. Since small objects frequently appear in various application domains, such as pedestrian detection [3], medical image analysis [4], face recognition [5], traffic sign detection [6], traffic light detection [7], ship detection [8], Synthetic Aperture Radar (SAR)-based object detection [9], it is worth examining the performance of modern deep learning SOD techniques. In this paper, we compare transformer-based detectors with Convolutional Neural Networks (CNNs) based detectors in Aref Miri Rekavandi and Mohammed Bennamoun are with the Department of Computer Science and Software Engineering, The University of Western Australia (Emails: aref.mirirekavandi@uwa.edu.au, mohammed.bennamoun@uwa.edu.au).…”