Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1442
|View full text |Cite
|
Sign up to set email alerts
|

Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers

Abstract: We address prediction of turn-taking considering related behaviors such as backchannels and fillers. Backchannels are used by the listeners to acknowledge that the current speaker can hold the turn. On the other hand, fillers are used by the prospective speakers to indicate a will to take a turn. We propose a turntaking model based on multitask learning in conjunction with prediction of backchannels and fillers. The multitask learning of LSTM neural networks shared by these tasks allows for efficient and gener… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
31
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 35 publications
(32 citation statements)
references
References 21 publications
1
31
0
Order By: Relevance
“…The two-layer structure enhanced the performance compared to simply combining all impactful features. The accuracy and F1-score of the combined model is comparable with some recent attempts on similar tasks such as the "ERICA" WOZ job interviews [7], which also used a relatively small corpus, and a little better than some recent largecorpora studies using Switchboard data [5,4]. Although such comparisons are of limited significance because of the many factors (discussed in 2) that affect turn-taking behavior and prediction, these results are encouraging given the open-endedness and complexity of our dialogue setting.…”
Section: Discussionsupporting
confidence: 77%
See 3 more Smart Citations
“…The two-layer structure enhanced the performance compared to simply combining all impactful features. The accuracy and F1-score of the combined model is comparable with some recent attempts on similar tasks such as the "ERICA" WOZ job interviews [7], which also used a relatively small corpus, and a little better than some recent largecorpora studies using Switchboard data [5,4]. Although such comparisons are of limited significance because of the many factors (discussed in 2) that affect turn-taking behavior and prediction, these results are encouraging given the open-endedness and complexity of our dialogue setting.…”
Section: Discussionsupporting
confidence: 77%
“…Another study also attempted to predict backchannels and fillers as well as turn-taking using prosody [7]. A general observation about prior studies is that F-scores for turn prediction depend very much on the scope of the dialogues (e.g., map task: 81.7 [17] vs. Switchboard: 65.8 [4]), the size of the training corpus (e.g., 2.5 hours, job interviews: 77.3 [7] vs. 11 hours, MAHNOB: 93.4 [18]), as well as what is being measured and predicted (e.g., use of visual as well as linguistic features, or inclusion/ exclusion of backchannels as turns). Also as pointed out in [22], in more difficult tasks pauses may be due to thinking about what to say rather than whether to yield the turn.…”
Section: Literature Reviewmentioning
confidence: 99%
See 2 more Smart Citations
“…They showed that the combination of prosodic and lexical features can lead to promising results. A turn-taking model based on multitask learning was proposed by [28], which also took into account the prediction of backchannels and fillers. An incremental turntaking model with active system barge-in was proposed by [29], who modeled the turn-taking problem as a Finite State Machine and learned the turn-taking policy by means of reinforcement learning.…”
Section: Related Workmentioning
confidence: 99%