Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2840
|View full text |Cite
|
Sign up to set email alerts
|

A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
32
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(35 citation statements)
references
References 4 publications
3
32
0
Order By: Relevance
“…Their model can achieve an EER (Equal Error Rate) of 5.2% on their proprietary dataset. In a study to improve device-directed utterance classification by [6], they present the performance for different dialog types and decoder features. Learning when to listen by [7] proposes using Prosodic features along with lexical features for better detection of utterance; they report EER of 6.72 with their best model.…”
Section: Related Workmentioning
confidence: 99%
“…Their model can achieve an EER (Equal Error Rate) of 5.2% on their proprietary dataset. In a study to improve device-directed utterance classification by [6], they present the performance for different dialog types and decoder features. Learning when to listen by [7] proposes using Prosodic features along with lexical features for better detection of utterance; they report EER of 6.72 with their best model.…”
Section: Related Workmentioning
confidence: 99%
“…Utterance-level acoustic embeddings computed using LSTMs have been previously used in [4,5,6] where they were combined with ASR decoding features for device-directed audio detection. Our approach depicted in Figure 1 differs from prior literature in multiple ways though.…”
Section: Introductionmentioning
confidence: 99%
“…Approaches for the classification of utterances into systemand non-system-directed ones typically use acoustic features extracted from the speech signal, e.g., [1,2,3,4,5]. Previous works [1,6,7] also show that using an attention mechanism *Author performed research herein as part of an internship-partnership program between Mila and Nuance. combined with a BiLSTM network can improve classification performance.…”
Section: Introductionmentioning
confidence: 99%
“…In addition to using conventional acoustic features, e.g., MFCCs or log filter-bank energies, some works also incorporate other acoustic and non-acoustic features as more representative cues of system-directed speech. Such works include [4] where prosodic features are used, and [2,4,6] where lexical and semantic features are derived from ASR decoders.…”
Section: Introductionmentioning
confidence: 99%