2020
DOI: 10.1007/978-3-030-66096-3_1
|View full text |Cite
|
Sign up to set email alerts
|

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Abstract: The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be assoc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1

Relationship

4
1

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 41 publications
1
7
0
Order By: Relevance
“…We project the top-32 confidence scoring predictions to the frontal view and evaluate the rate at which at least one sample achieves an IoU ≥ 0.5 with the ground truth referred object in the Talk2Car test set. We find that such a bounding box exists in 92% of the cases which is the same as with the bounding boxes of (Deruyttere et al 2020), and therefore, no substantial increase in IoU is expected solely from improved bounding box predictions. The average distance between the 3D bounding box of the ground truth referred object and their 3D predicted bounding is 1.7m on the Talk2Car test set.…”
Section: D Object Detectorsupporting
confidence: 56%
“…We project the top-32 confidence scoring predictions to the frontal view and evaluate the rate at which at least one sample achieves an IoU ≥ 0.5 with the ground truth referred object in the Talk2Car test set. We find that such a bounding box exists in 92% of the cases which is the same as with the bounding boxes of (Deruyttere et al 2020), and therefore, no substantial increase in IoU is expected solely from improved bounding box predictions. The average distance between the 3D bounding box of the ground truth referred object and their 3D predicted bounding is 1.7m on the Talk2Car test set.…”
Section: D Object Detectorsupporting
confidence: 56%
“…An extra experiment where we heavily undersample the training set further supports this point and reveals that ten samples per category suffice to attain performances close to those using the full data set. Overall, the low computational cost and high predictive power of principled canonical representations support their potential use as a compact scene descriptor for applications that require fast real-time scene understanding, such as realized by self-driving cars [12].…”
Section: Volume X 2021mentioning
confidence: 98%
“…Only the image is provided, and no 32 information about the Object classes is given. One must 12 A few images in HICO-DET have more than one O class -the majority of images have a unique object class. There are often multiple instances of the same object class (e.g., bikes) and / or multiple triplet classes t with the same O involved (e.g., (human, repairing, bike) and (human, holding, bike)).…”
Section: Layout Transformationsmentioning
confidence: 99%
See 1 more Smart Citation
“…When passengers give a command to a self-driving car that refers to a specific object, a first step in the command understanding is to detect this object (Deruyttere et al, 2020b). This task is often called visual grounding (VG) in the literature.…”
Section: Introductionmentioning
confidence: 99%