2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003816
|View full text |Cite
|
Sign up to set email alerts
|

Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 26 publications
0
10
0
Order By: Relevance
“…We used high-level abstracted features extracted from acoustic, linguistic, and visual modalities. We plan to use other interpretable features, such as prosody (Ferrer et al, 2002;Holler and Kendrick, 2015;Hömke et al, 2017;Holler et al, 2018;Masumura et al, 2018Masumura et al, , 2019Roddy et al, 2018) and gaze behavior (Chen and Harper, 2009;Kawahara et al, 2012;Jokinen et al, 2013;Ishii et al, 2015aIshii et al, , 2016a and to implement more complex predictive models (Masumura et al, 2018(Masumura et al, , 2019Roddy et al, 2018;Ward et al, 2018) that take into account temporal dependencies. Hara et al (2018) proposed a predictive model that can predict backchannels and fillers in addition to turn-changing using multi-task learning.…”
Section: Future Workmentioning
confidence: 99%
See 2 more Smart Citations
“…We used high-level abstracted features extracted from acoustic, linguistic, and visual modalities. We plan to use other interpretable features, such as prosody (Ferrer et al, 2002;Holler and Kendrick, 2015;Hömke et al, 2017;Holler et al, 2018;Masumura et al, 2018Masumura et al, , 2019Roddy et al, 2018) and gaze behavior (Chen and Harper, 2009;Kawahara et al, 2012;Jokinen et al, 2013;Ishii et al, 2015aIshii et al, , 2016a and to implement more complex predictive models (Masumura et al, 2018(Masumura et al, , 2019Roddy et al, 2018;Ward et al, 2018) that take into account temporal dependencies. Hara et al (2018) proposed a predictive model that can predict backchannels and fillers in addition to turn-changing using multi-task learning.…”
Section: Future Workmentioning
confidence: 99%
“…As a result of previous research on conversation turns and behaviors, many studies have developed models for predicting actual turn-changing, i.e., whether turn-changing or turn-keeping will take place, on the basis of acoustic features (Ferrer et al, 2002 ; Schlangen, 2006 ; Chen and Harper, 2009 ; de Kok and Heylen, 2009 ; Huang et al, 2011 ; Laskowski et al, 2011 ; Eyben et al, 2013 ; Jokinen et al, 2013 ; Hara et al, 2018 ; Lala et al, 2018 ; Masumura et al, 2018 , 2019 ; Roddy et al, 2018 ; Ward et al, 2018 ). They have used representative acoustic features from the speaker's speech such as log-mel and mel-frequency cepstral coefficients (MFCCs) as feature values.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…With such knowledge, many studies have developed models for predicting actual turn-changing, i.e., whether turn-changing or turn-keeping will take place, on the basis of acoustic features [3, 6, 10, 12, 18, 26, 34, 36ś38, 43, 47, 50], linguistic features [34,37,38,43], and visual features, such as overall physical motion [3,6,8,43] near the end of a speaker's utterances or during multiple utterances. Moreover, some research has focused on detailed non-verbal behaviors such as eye-gaze behavior [3,6,18,20,24,26], head movement [18,21,22], mouth movement [23], and respiration [20,25].…”
Section: Related Work 21 Turn-changing Prediction Technologymentioning
confidence: 99%
“…We used automatically high-level abstracted features extracted from acoustic, linguistic, and visual modalities. We plan to use other interpretable features, such as prosody [10,15,16,19,37,38,43] and gaze behavior [3,20,24,26,30], and implement more complex prediction models [37,38,43,50] that take into account temporal dependencies.…”
Section: Future Workmentioning
confidence: 99%