2022
DOI: 10.1007/978-3-031-20059-5_29
|View full text |Cite
|
Sign up to set email alerts
|

FindIt: Generalized Localization with Natural Language Queries

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(30 citation statements)
references
References 72 publications
0
30
0
Order By: Relevance
“…Bansal et al [4] introduce ZS+OV detection where the classification layer of a closed vocabulary detector is replaced with the text embeddings of the class names, an approach taken by many subsequent works [11,14,16,24,31,42,46,46], including this one. Some works [16,24,42] take the OV classification closer to the backbone features by directly extracting them from object proposals with ROI-Align [20], and optionally distill a strong OV classifier into the detector [16]. To improve ZS performance, Detic [46] and PromptDet [14] forego the OV aspect -knowing the names of the classes of interest (i.e.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Bansal et al [4] introduce ZS+OV detection where the classification layer of a closed vocabulary detector is replaced with the text embeddings of the class names, an approach taken by many subsequent works [11,14,16,24,31,42,46,46], including this one. Some works [16,24,42] take the OV classification closer to the backbone features by directly extracting them from object proposals with ROI-Align [20], and optionally distill a strong OV classifier into the detector [16]. To improve ZS performance, Detic [46] and PromptDet [14] forego the OV aspect -knowing the names of the classes of interest (i.e.…”
Section: Related Workmentioning
confidence: 99%
“…We use the LVIS v1.0 [18] object detection benchmark adapted for zero-shot evaluation; we call this setup LVIS -R . Following standard practice [16,24,46], the rare class annotations are removed from the training set, keeping only the frequent and common annotations (often called LVIS-base). Evaluation is then performed on all classes, reporting the box mAP for all classes (mAP all ) and the mAP on rare classes (mAP rare ), with the emphasis on mAP rare as this measures the zero-shot performance, rare classes playing the role of the unseen classes.…”
Section: Baseline Detector and Experimental Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…Our approach combines visual and language modalities to improve object anticipation. Most recent multimodal architectures are based on Transformer fusion schemes applied on features extracted by various encoders [5,12,27]. We follow the spirit by employing self-attention-based modality fusion.…”
Section: Related Workmentioning
confidence: 99%
“…FIBER (Dou et al, 2022a) improves GLIP by (i) using a coarse-to-fine pretraining pipeline, and (ii) performing fusion in the backbone rather than in the OD head as in GLIP. (Zang et al, 2022), X-DETR (Cai et al, 2022), FindIT (Kuo et al, 2022), PromptDet (Feng et al, 2022), and OWL-ViT (Minderer et al, 2022).…”
Section: Two-stage Modelsmentioning
confidence: 99%