Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Unsupervised domain adaptation (UDA) aims to mitigate the performance drop due to the distribution shift between the training and testing datasets. UDA methods have achieved performance gains for models trained on a source domain with labeled data to a target domain with only unlabeled data. The standard feature extraction method in domain adaptation has been convolutional neural networks (CNNs). Recently, attention-based transformer models have emerged as effective alternatives for computer vision tasks. In this paper, we benchmark three attention-based architectures, specifically vision transformer (ViT), shifted window transformer (SWIN), and dual attention vision transformer (DAViT), against convolutional architectures ResNet, HRNet and attention-based ConvNext, to assess the performance of different backbones for domain generalization and adaptation. We incorporate these backbone architectures as feature extractors in the source hypothesis transfer (SHOT) framework for UDA. SHOT leverages the knowledge learned in the source domain to align the image features of unlabeled target data in the absence of source domain data, using self-supervised deep feature clustering and self-training. We analyze the generalization and adaptation performance of these models on standard UDA datasets and aerial UDA datasets. In addition, we modernize the training procedure commonly seen in UDA tasks by adding image augmentation techniques to help models generate richer features. Our results show that ConvNext and SWIN offer the best performance, indicating that the attention mechanism is very beneficial for domain generalization and adaptation with both transformer and convolutional architectures. Our ablation study shows that our modernized training recipe, within the SHOT framework, significantly boosts performance on aerial datasets.

Section: Vision Transformers For Domain Adaptationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

On the Importance of Attention and Augmentations for Hypothesis Transfer in Domain Adaptation and Generalization

Sahay,

Thomas,

Jahan

et al. 2023

“…Recently, transformer-based object detection models also showed state-of-the-art accuracies on the public datasets [ 21 , 22 , 23 , 24 ]. They used transformer encoder-decoder structures in the object detection problem to predict a set of boxes that were bipartitely matched to the ground truth object boxes.…”

Section: Related Workmentioning

confidence: 99%

Multi-Object Tracking on SWIR Images for City Surveillance in an Edge-Computing Environment

Park¹,

Hong²,

Shim³

et al. 2023

Although Short-Wave Infrared (SWIR) sensors have advantages in terms of robustness in bad weather and low-light conditions, the SWIR images have not been well studied for automated object detection and tracking systems. The majority of previous multi-object tracking studies have focused on pedestrian tracking in visible-spectrum images, but tracking different types of vehicles is also important in city-surveillance scenarios. In addition, the previous studies were based on high-computing-power environments such as GPU workstations or servers, but edge computing should be considered to reduce network bandwidth usage and privacy concerns in city-surveillance scenarios. In this paper, we propose a fast and effective multi-object tracking method, called Multi-Class Distance-based Tracking (MCDTrack), on SWIR images of city-surveillance scenarios in a low-power and low-computation edge-computing environment. Eight-bit integer quantized object detection models are used, and simple distance and IoU-based similarity scores are employed to realize effective multi-object tracking in an edge-computing environment. Our MCDTrack is not only superior to previous multi-object tracking methods but also shows high tracking accuracy of 77.5% MOTA and 80.2% IDF1 although the object detection and tracking are performed on the edge-computing device. Our study results indicate that a robust city-surveillance solution can be developed based on the edge-computing environment and low-frame-rate SWIR images.

“…Convolutional neural networks (CNN) typically form the backbone of modern object detection models. Recently, large-scale vision transformer models [ 19 , 20 , 21 ] have yielded impressive results but require training on massive datasets.…”

Section: Related Workmentioning

confidence: 99%

Long-Range Thermal Target Detection in Data-Limited Settings Using Restricted Receptive Fields

Poster,

Hu,

Nasrabadi

2023

Long-range target detection in thermal infrared imagery is a challenging research problem due to the low resolution and limited detail captured by thermal sensors. The limited size and variability in thermal image datasets for small target detection is also a major constraint for the development of accurate and robust detection algorithms. To address both the sensor and data constraints, we propose a novel convolutional neural network (CNN) feature extraction architecture designed for small object detection in data-limited settings. More specifically, we focus on long-range ground-based thermal vehicle detection, but also show the effectiveness of the proposed algorithm on drone and satellite aerial imagery. The design of the proposed architecture is inspired by an analysis of popular object detectors as well as custom-designed networks. We find that restricted receptive fields (rather than more globalized features, as is the trend), along with less downsampling of feature maps and attenuated processing of fine-grained features, lead to greatly improved detection rates while mitigating the model’s capacity to overfit on small or poorly varied datasets. Our approach achieves state-of-the-art results on the Defense Systems Information Analysis Center (DSIAC) automated target recognition (ATR) and the Tiny Object Detection in Aerial Images (AI-TOD) datasets.