Proceedings of the 20th ACM International Conference on Multimodal Interaction 2018
DOI: 10.1145/3242969.3242994
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios

Abstract: The task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and unstructured discourse. Our goal is to determine how much a generalized model trained on many types of dialogue scenarios would improve on a model trained only for a specific scenario. To achieve this goal we created a large corpus of Wizard-of-Oz conversation data whic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
25
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 27 publications
(25 citation statements)
references
References 18 publications
0
25
0
Order By: Relevance
“…We experimentally evaluated the proposed model on a Japanese conversation corpus (identical as that previously used [10]) that consists of four types of conversations and over 30, 000 utterances. The type and the number of the sessions are shown in Table 1.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…We experimentally evaluated the proposed model on a Japanese conversation corpus (identical as that previously used [10]) that consists of four types of conversations and over 30, 000 utterances. The type and the number of the sessions are shown in Table 1.…”
Section: Discussionmentioning
confidence: 99%
“…Such hard-coded models are difficult to transfer to other languages/cultures since they are culture-dependent. Data-driven methods such as finite state machine-based [7] and neural network-based models [8,9,10] have also been proposed in recent years. These works use feature sequences extracted from both text and speech signals.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…This model detects TRP at the end of IPU, based on prosodic and linguistic information of the preceding utterance. We used a hierarchical model of LSTM where each kind of feature is modeled by an individual LSTM and the outputs of those LSTMs are concatenated and fed into to a linear layer that outputs the posterior probability of the output label [29], as shown in Figure 2. The reference labels are binary corresponding to the TRP labels annotated in Section 3.…”
Section: Trp Detectionmentioning
confidence: 99%
“…The prediction model was based on conditional random field [16], support vector machines [24], and neural networks [25]. A recent approach is to use recurrent neural networks such as long shortterm memory (LSTM), which can handle long-range context of the input sequence, and it achieved higher accuracy than conventional methods [15,26,19,27,28,29,30]. However, the performance is still low in natural conversations.…”
Section: Introductionmentioning
confidence: 99%