Proceedings of the 20th ACM International Conference on Multimodal Interaction 2018
DOI: 10.1145/3242969.3242997
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

Abstract: In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguis… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 23 publications
(18 citation statements)
references
References 13 publications
0
18
0
Order By: Relevance
“…With such knowledge, many studies have developed models for predicting actual turn-changing, i.e., whether turn-changing or turn-keeping will take place, on the basis of acoustic features [3, 6, 10, 12, 18, 26, 34, 36ś38, 43, 47, 50], linguistic features [34,37,38,43], and visual features, such as overall physical motion [3,6,8,43] near the end of a speaker's utterances or during multiple utterances. Moreover, some research has focused on detailed non-verbal behaviors such as eye-gaze behavior [3,6,18,20,24,26], head movement [18,21,22], mouth movement [23], and respiration [20,25].…”
Section: Related Work 21 Turn-changing Prediction Technologymentioning
confidence: 99%
See 1 more Smart Citation
“…With such knowledge, many studies have developed models for predicting actual turn-changing, i.e., whether turn-changing or turn-keeping will take place, on the basis of acoustic features [3, 6, 10, 12, 18, 26, 34, 36ś38, 43, 47, 50], linguistic features [34,37,38,43], and visual features, such as overall physical motion [3,6,8,43] near the end of a speaker's utterances or during multiple utterances. Moreover, some research has focused on detailed non-verbal behaviors such as eye-gaze behavior [3,6,18,20,24,26], head movement [18,21,22], mouth movement [23], and respiration [20,25].…”
Section: Related Work 21 Turn-changing Prediction Technologymentioning
confidence: 99%
“…We used automatically high-level abstracted features extracted from acoustic, linguistic, and visual modalities. We plan to use other interpretable features, such as prosody [10,15,16,19,37,38,43] and gaze behavior [3,20,24,26,30], and implement more complex prediction models [37,38,43,50] that take into account temporal dependencies.…”
Section: Future Workmentioning
confidence: 99%
“…Our model is also related to continuous turn-taking systems (Skantze, 2017) in that our model is trained to predict future speech behavior on a frame-by-frame basis. The encoder uses a multiscale RNN architecture similar to the one proposed in Roddy et al (2018) to fuse information across modalities. Models that intentionally generate responsive overlap have been proposed in DeVault et al (2011);Dethlefs et al (2012).…”
Section: Introductionmentioning
confidence: 99%
“…Linguistic features were also investigated such as syntactic structure, turn-ending markers, and language model [14,15]. Moreover, multi-modal features were also considered such as eye-gaze [16,17,18,19], respiration [20,21,22], and head-direction [16,23]. The prediction model was based on conditional random field [16], support vector machines [24], and neural networks [25].…”
Section: Introductionmentioning
confidence: 99%
“…The prediction model was based on conditional random field [16], support vector machines [24], and neural networks [25]. A recent approach is to use recurrent neural networks such as long shortterm memory (LSTM), which can handle long-range context of the input sequence, and it achieved higher accuracy than conventional methods [15,26,19,27,28,29,30]. However, the performance is still low in natural conversations.…”
Section: Introductionmentioning
confidence: 99%