Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios

Lala, Divesh; Inoue, Koji; Kawahara, Tatsuya

doi:10.1145/3242969.3242994

Cited by 27 publications

(25 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We experimentally evaluated the proposed model on a Japanese conversation corpus (identical as that previously used [10]) that consists of four types of conversations and over 30, 000 utterances. The type and the number of the sessions are shown in Table 1.…”

Section: Discussionmentioning

confidence: 99%

“…Such hard-coded models are difficult to transfer to other languages/cultures since they are culture-dependent. Data-driven methods such as finite state machine-based [7] and neural network-based models [8,9,10] have also been proposed in recent years. These works use feature sequences extracted from both text and speech signals.…”

Section: Introductionmentioning

confidence: 99%

“…In the proposed model, the lexical information is processed by a capsule network [19] with a convolutional layer, where the acoustic information is handled by a dilated convolutional network with ResNets [17] as its building blocks. We experimentally compared our proposed model with two RNN-based models (nested [8] and stacked [10]) on a Japanese conversational corpus that included four types of conversations: dating, job interviews, attentive listening, and at reception-counter conversations. Our experimental results show that the proposed non-RNN model outperformed RNN-based networks in a turntaking estimation task.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Neural Turn-Taking Model without RNN

Liu

Ishi²,

Ishiguro

2019

Interspeech 2019

View full text Add to dashboard Cite

Sequential data such as speech and dialogs are usually modeled by Recurrent Neural Networks (RNN) and derivatives since the information can travel through time with such architecture. However, disadvantages exist with the use of RNNs, including the limited depth of neural networks and the GPU's unfriendly training process. Estimating the timing of turn-taking is a critical feature of dialog systems. Such tasks require knowledge about past dialog contexts and have been modeled using RNNs in several studies. In this paper, we propose a non-RNN model for the timing estimation of turn-taking in dialogs. The proposed model takes lexical and acoustic features as its input to predict a turn's end. We conducted experiments on four types of Japanese conversation datasets and show that with proper neural network designs, the long-term information in a dialog could propagate without a recurrent structure. The proposed model outperformed canonical RNN-based architectures on a turn-taking estimation task.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Neural Turn-Taking Model without RNN

Liu

Ishi²,

Ishiguro

2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…This model detects TRP at the end of IPU, based on prosodic and linguistic information of the preceding utterance. We used a hierarchical model of LSTM where each kind of feature is modeled by an individual LSTM and the outputs of those LSTMs are concatenated and fed into to a linear layer that outputs the posterior probability of the output label [29], as shown in Figure 2. The reference labels are binary corresponding to the TRP labels annotated in Section 3.…”

Section: Trp Detectionmentioning

confidence: 99%

“…The prediction model was based on conditional random field [16], support vector machines [24], and neural networks [25]. A recent approach is to use recurrent neural networks such as long shortterm memory (LSTM), which can handle long-range context of the input sequence, and it achieved higher accuracy than conventional methods [15,26,19,27,28,29,30]. However, the performance is still low in natural conversations.…”

Section: Introductionmentioning

confidence: 99%

Turn-Taking Prediction Based on Detection of Transition Relevance Place

Hara¹,

Inoue²,

Takanashi³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

We address turn-taking prediction in which spoken dialogue systems predict when to take the conversational floor. In natural conversations, many turn-taking decisions are arbitrary and subjective. In this study, we propose taking into account the concept of the transition relevance place (TRP) for turn-taking prediction. TRP is defined as a timing when the current speaking turn can be completed and other participants are able to take the turn. We conducted annotation of TRP on a human-robot dialogue corpus, ensuring the objectivity of this annotation among annotators. The proposed turn-taking prediction model adopts a two-step approach that detects TRP at first and then predicts a turn-taking event if TRP is detected. Experimental evaluations demonstrate that the proposed model improves the accuracy of turn-taking prediction by incorporating TRP detection.

show abstract

Spoken Dialogue Technology for Semi-Autonomous Cybernetic Avatars

Kawahara,

Saruwatari,

Higashinaka

et al. 2024

Cybernetic Avatar

View full text Add to dashboard Cite

Speech technology has made significant advances with the introduction of deep learning and large datasets, enabling automatic speech recognition and synthesis at a practical level. Dialogue systems and conversational AI have also achieved dramatic advances based on the development of large language models. However, the application of these technologies to humanoid robots remains challenging because such robots must operate in real time and in the real world. This chapter reviews the current status and challenges of spoken dialogue technology for communicative robots and virtual agents. Additionally, we present a novel framework for the semi-autonomous cybernetic avatars investigated in this study.

show abstract

Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios

Cited by 27 publications

References 18 publications

A Neural Turn-Taking Model without RNN

A Neural Turn-Taking Model without RNN

Turn-Taking Prediction Based on Detection of Transition Relevance Place

Spoken Dialogue Technology for Semi-Autonomous Cybernetic Avatars

Contact Info

Product

Resources

About