Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models

Masumura, Ryo; Tanaka, Tomohiro; Moriya, Takafumi; Shinohara, Yusuke; Oba, Tadamichi; Aono, Yushi

doi:10.1109/icassp.2019.8683843

Cited by 28 publications

(19 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Large-context encoder-decoder models: Large-context encoderdecoder models that can capture long-range linguistic contexts beyond sentence boundaries or utterance boundaries have received significant attention in E2E-ASR [7,8], machine translation [14,15], and some natural language generation tasks [16,17]. In recent studies, transformer-based large-context encoder-decoder models have been introduced in machine translation [18,19].…”

Section: Related Workmentioning

confidence: 99%

“…We compared our proposed hierarchical transformer-based largecontext E2E-ASR model with an RNN-based utterance-level E2E-ASR model [3], transformer-based utterance-level E2E-ASR model [6], and hierarchical RNN-based large-context E2E-ASR model [8].…”

Section: Setupsmentioning

confidence: 99%

“…The acoustic features passed two convolution and max pooling layers with a stride of 2, so we down-sampled them to 1/4 along with the time axis. For the RNN-based models, the same setup as in previous studies were used [8]. For the hierarchical text encoder, we stacked two token-level transformer encoder blocks and two utterance-level transformer encoder blocks.…”

Section: Setupsmentioning

confidence: 99%

“…On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, have received increasing attention. Previous studies reported that large-context models outperform utterance-level models in discourse or conversation ASR tasks [7,8], and hierarchical RNN encoder-decoder modeling has been mainly introduced into the large-context E2E-ASR models. However, the transformer architecture has not yet been introduced into the large-context ASR systems.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Masumura

Makishima

Ihori

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling. In addition, in order to enable the proposed model to use long-range sequential contexts, we also propose a large-context knowledge distillation that distills the knowledge from a pre-trained large-context language model in the training phase. We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Setupsmentioning

confidence: 99%

Section: Setupsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Masumura

Makishima

Ihori

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Alternatively, global and local topic vectors, and neural-based cache models were integrated into LMs [17][18][19]. More recently, an extra neural network component, such as a hierarchical RNN or a pretrained LM [20], was used to encode the cross-utterance information into a vector representation for LM adaptation [21][22][23]. On the other hand, improvements in cross-utterance TLMs were mainly from efficient extension of attention spans, such as using segment-wise recurrence between two adjacent segments [11], adopting adaptive attention spans, or applying specially-designed masks to cope with much longer input sequences [24,25].…”

Section: Introductionmentioning

confidence: 99%

Transformer Language Models with LSTM-Based Cross-Utterance Information Representation

Sun

Zhang

Woodland

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The effective incorporation of cross-utterance information has the potential to improve language models (LMs) for automatic speech recognition (ASR). To extract more powerful and robust crossutterance representations for the Transformer LM (TLM), this paper proposes the R-TLM which uses hidden states in a long short-term memory (LSTM) LM. To encode the cross-utterance information, the R-TLM incorporates an LSTM module together with a segmentwise recurrence in some of the Transformer blocks. In addition to the LSTM module output, a shortcut connection using a fusion layer which bypasses the LSTM module is also investigated. The proposed system was evaluated on the AMI meeting corpus, the Eval2000 and the RT03 telephone conversation evaluation sets. The best R-TLM achieved 0.9%, 0.6% and 0.8% absolute WER reductions over the single-utterance TLM baseline, and 0.5%, 0.3%, 0.2% absolute WER reductions over a strong cross-utterance TLM baseline on the AMI evaluation set, Eval2000 and RT03 respectively. Improvements on Eval2000 and RT03 were further supported by significance tests. R-TLMs were found to have better LM scores on words where recognition errors are more likely to occur. The R-TLM WER can be further reduced by interpolation with an LSTM-LM.

show abstract

Zooming out on bargaining tables: Exploring which conversation dynamics predict negotiation outcomes.

Di Stasi,

Templeton,

Quoidbach

2024

Journal of Applied Psychology

View full text Add to dashboard Cite

How much should you talk, pause, or interrupt your counterpart in negotiations? The present research zooms out on the macrostructure of negotiation conversations to examine how systematic differences in conversation dynamics—the structural and temporal patterns that arise from the presence or absence of speech between interlocutors—relate to objective and relational outcomes at the bargaining table. We examined 38,564 speech turns from 239 online negotiation recordings and derived, for each negotiator (N = 380), 16 measures pertaining to seven dimensions of conversation dynamics: speaking time, turn length, pauses, speech rate, interruptions, backchannels, and response time. Network analyses reveal that many of these measures are interconnected, with clusters of variables suggesting broad differences in negotiators’ propensity to “talk vs. listen” and to mimic their counterparts. Regression and Least Absolute Shrinkage and Selection Operator (LASSO) analyses further show that several measures uniquely predict objective and relational outcomes in videoconference negotiations. At the objective level, negotiators who speak more, faster, and with fewer pauses tend to get better deals. At the relational level, negotiators who refrain from interrupting and display more dynamic turn length (i.e., low similarity over successive turns) are better liked. Taken together, the results suggest that conversation dynamics could make or break deals.

show abstract

Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models

Cited by 28 publications

References 17 publications

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Transformer Language Models with LSTM-Based Cross-Utterance Information Representation

Zooming out on bargaining tables: Exploring which conversation dynamics predict negotiation outcomes.

Contact Info

Product

Resources

About