2021
DOI: 10.1007/978-3-030-83527-9_46
|View full text |Cite
|
Sign up to set email alerts
|

A Multimodal Model for Predicting Conversational Feedbacks

Abstract: We propose in this paper a statistical model in the perspective of predicting listener's feedbacks in a conversation. The first contribution of the paper is a study of the prediction of all feedbacks, including those in overlap with the speaker with a good accuracy. Existing model are good at predicting feedbacks during a pause, but reach a very low success level for all feedbacks. We give in this paper a first step towards this complex problem. The second contribution is a model predicting precisely the type … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
2
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(13 citation statements)
references
References 24 publications
0
10
2
1
Order By: Relevance
“…Visual features. We included features that have been used in previous work on BC such as head movement (nodding and shaking) [2,5,21,28], gaze [17,21,38], eyebrow movements (raising and frowning) [21,22], and facial expressions (smiling and laughing) [2,21]. These features were annotated manually and were available with the dataset (for details, see [3]).…”
Section: Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…Visual features. We included features that have been used in previous work on BC such as head movement (nodding and shaking) [2,5,21,28], gaze [17,21,38], eyebrow movements (raising and frowning) [21,22], and facial expressions (smiling and laughing) [2,21]. These features were annotated manually and were available with the dataset (for details, see [3]).…”
Section: Modelsmentioning
confidence: 99%
“…We selected a subset of the eGeMAPS that were used in several previous studies. We included pitch (variation) [5,21,22,27,28,40], Mel-Frequency Cepstral Coefficients (MFCC) [17,21,28,34], voice quality [17,21,28,34], energy [18,31,34] and pausal information [5,6,21,32]. To minimize identify-confounding [29] the features were centered and scaled for each participant.…”
Section: Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Most common multimodal settings combined landmarks, body/head pose, or visual cues with past utterance transcriptions (Chu et al, 2018;Hua et al, 2019;Ueno et al, 2020;), acoustic features (Türker et al, 2018Ahuja et al, 2019;Ueno et al, 2020;Goswami et al, 2020;Woo et al, 2021;Jain and Leekha, 2021;Murray et al, 2021;Ben-Youssef et al, 2021), speaker's metadata (Raman et al, 2021;, or with combinations of the previous modalities (Ishii et al, 2020;Huang et al, 2020;Blache et al, 2020;Ishii et al, 2021;Boudin et al, 2021). The most common way to exploit different modalities together consists in simply concatenating their embedded representations.…”
Section: Input Modalitiesmentioning
confidence: 99%
“…More recently, the collection, annotation and release of bigger datasets favored the appearance of data-driven automated multimodal methods for backchannel prediction. For example, Boudin et al (2021) used a logistic classifier that was trained on visual cues, prosodic and lexico-syntactic features in order to predict not only the backchannel opportunity but also their subtype associated (generic, positive, or expected). The choice of such a simple classifier was driven by the small dataset available.…”
Section: High-levelmentioning
confidence: 99%