2022
DOI: 10.48550/arxiv.2211.03594
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Abstract: All the results are achieved with test time augmentation. In the table, we follow the notations for various datasets used in DINO [28] and FocalNet [26].'w/ Mask' means using mask annotations when finetuning the detectors on COCO [13].

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 19 publications
0
5
0
Order By: Relevance
“…In a traditional ViT model, the processing stages are applied to a sequence of fixed, non-overlapping image patches, and a self-attention mechanism is used to encode context. ViT and its variants have demonstrated wide applicability in object detection [4,36], segmentation [5,37], and video understanding [6]. The success of vision transformers is attributed to global context modeling [33] by pretraining on large-scale data, as opposed to relying on image-specific inductive biases, e.g., translation equivariance employed by traditional CNNs.…”
Section: Vision Transformers For Domain Adaptationmentioning
confidence: 99%
See 1 more Smart Citation
“…In a traditional ViT model, the processing stages are applied to a sequence of fixed, non-overlapping image patches, and a self-attention mechanism is used to encode context. ViT and its variants have demonstrated wide applicability in object detection [4,36], segmentation [5,37], and video understanding [6]. The success of vision transformers is attributed to global context modeling [33] by pretraining on large-scale data, as opposed to relying on image-specific inductive biases, e.g., translation equivariance employed by traditional CNNs.…”
Section: Vision Transformers For Domain Adaptationmentioning
confidence: 99%
“…In recent years, the emergence of attention-based transformer models [1][2][3] has stimulated interest in new architectures that have achieved state-of-the-art results on a wide variety of computer vision tasks [4][5][6]. Along with these exciting developments and the growing interest in deploying models in practice, there is a need to investigate the robustness of attention-based models when deployed to new settings.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, transformer-based object detection models also showed state-of-the-art accuracies on the public datasets [ 21 , 22 , 23 , 24 ]. They used transformer encoder-decoder structures in the object detection problem to predict a set of boxes that were bipartitely matched to the ground truth object boxes.…”
Section: Related Workmentioning
confidence: 99%
“…Convolutional neural networks (CNN) typically form the backbone of modern object detection models. Recently, large-scale vision transformer models [ 19 , 20 , 21 ] have yielded impressive results but require training on massive datasets.…”
Section: Related Workmentioning
confidence: 99%