Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2270
|View full text |Cite
|
Sign up to set email alerts
|

A Neural Turn-Taking Model without RNN

Abstract: Sequential data such as speech and dialogs are usually modeled by Recurrent Neural Networks (RNN) and derivatives since the information can travel through time with such architecture. However, disadvantages exist with the use of RNNs, including the limited depth of neural networks and the GPU's unfriendly training process. Estimating the timing of turn-taking is a critical feature of dialog systems. Such tasks require knowledge about past dialog contexts and have been modeled using RNNs in several studies. In … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 19 publications
(32 reference statements)
0
1
0
Order By: Relevance
“…However, this approach is limited in its naturalness that the fixed threshold can potentially be too short (frequent interruptions) or too long (awkward pauses). To address this problem, machine learning methods have gained popularity since 1970s [3, 4, inter alia], and models based on inter-pausal unit (IPU), an audio segment followed by silence longer than 200 milliseconds, have mostly been studied recently because of its simplicity in practice [5,6]. For a specific IPU, various cues across modalities, such as prosody, semantics, syntax, breathing, gesture, and eye-gaze can be extracted and integrated to determine whether this turn is yielded or not [7,8].…”
Section: Introductionmentioning
confidence: 99%
“…However, this approach is limited in its naturalness that the fixed threshold can potentially be too short (frequent interruptions) or too long (awkward pauses). To address this problem, machine learning methods have gained popularity since 1970s [3, 4, inter alia], and models based on inter-pausal unit (IPU), an audio segment followed by silence longer than 200 milliseconds, have mostly been studied recently because of its simplicity in practice [5,6]. For a specific IPU, various cues across modalities, such as prosody, semantics, syntax, breathing, gesture, and eye-gaze can be extracted and integrated to determine whether this turn is yielded or not [7,8].…”
Section: Introductionmentioning
confidence: 99%