2018 IEEE Winter Conference on Applications of Computer Vision (WACV) 2018
DOI: 10.1109/wacv.2018.00206
|View full text |Cite
|
Sign up to set email alerts
|

Object Referring in Visual Scene with Spoken Language

Abstract: Object referring has important applications, especially for human-machine interaction. While having received great attention, the task is mainly attacked with written language (text) as input rather than spoken language (speech), which is more natural. This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach. Objects are annotated with their locations in images, text descriptions and speech descriptions. This makes the datasets ideal for multi-m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
16
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 15 publications
(16 citation statements)
references
References 43 publications
0
16
0
Order By: Relevance
“…Using speech as an input modality has a long history (Bolt 1980) and is recently emerging as a research direction in Computer Vision (Dai 2016;Vasudevan et al 2017;Vaidyanathan et al 2018;Harwath et al 2018). To the best of our knowledge, however, our paper is the first to show that speech allows for more efficient object class labelling than (Lin et al 2014;Deng et al 2014) and enables simultaneous class and box labelling.…”
Section: Related Workmentioning
confidence: 80%
“…Using speech as an input modality has a long history (Bolt 1980) and is recently emerging as a research direction in Computer Vision (Dai 2016;Vasudevan et al 2017;Vaidyanathan et al 2018;Harwath et al 2018). To the best of our knowledge, however, our paper is the first to show that speech allows for more efficient object class labelling than (Lin et al 2014;Deng et al 2014) and enables simultaneous class and box labelling.…”
Section: Related Workmentioning
confidence: 80%
“…[52] shows that LOP performs significantly better than other techniques when we propose expression-aware object candidates. For the same reason, we use LOP [52] for the object proposals.…”
Section: Object Proposalsmentioning
confidence: 92%
“…For instance, [23] uses EdgeBox [59] for the object proposals; [34] and [25] use the faster RCNN (FRCNN) object detector [43], Mask-RCNN [18] and Language based Object Proposals (LOP) [52] and others to propose the candidates. [52] shows that LOP performs significantly better than other techniques when we propose expression-aware object candidates. For the same reason, we use LOP [52] for the object proposals.…”
Section: Object Proposalsmentioning
confidence: 99%
“…In Computer Vision, Vasudevan et al [32] detect objects given spoken referring expressions, while Harwath et al [9] learn an embedding from spoken image-caption pairs. Their approach obtains promising first results, but still performs inferior to learning on top of textual captions obtained from Google's automatic speech recognition.…”
Section: Related Workmentioning
confidence: 99%