Cold Fusion: Training Seq2Seq Models Together with Language Models

Sriram, Anuroop; Jun, Heewoo; Satheesh, Sanjeev; Coates, Adam

doi:10.21437/interspeech.2018-1392

Cited by 220 publications

(170 citation statements)

References 17 publications

Supporting

Mentioning

169

Contrasting

Order By: Relevance

“…These models are usually trained with character-based units and decoded with a basic beam search. There has been extensive efforts to develop decoding algorithms that can use external LMs, so-called fusion methods [27,28,29,30,31]. However, these methods have shown relatively small gains on large-scale ASR tasks [32].…”

Section: Introductionmentioning

confidence: 99%

Hybrid Autoregressive Transducer (HAT)

Variani¹,

Rybach²,

Allauzen³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

119

View full text Add to dashboard Cite

This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This article also presents a finite context version of the HAT model that addresses the exposure bias problem and significantly simplifies the overall training and inference. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches .Index Terms-ASR, Encoder-decoder, Beam Search T t=1 P ( Y t =ỹ t |X). Finally P (Y |X) is calculated by marginalizing over the alignment posteriors with Eq 2.

show abstract

Section: Introductionmentioning

confidence: 99%

Hybrid Autoregressive Transducer (HAT)

Variani¹,

Rybach²,

Allauzen³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

119

View full text Add to dashboard Cite

show abstract

“…For this specific task with on a medium-sized corpus, the hybrid approach yields significantly better results. To achieve better performance with the bLSTMs, its output needs to be combined with LM based prefix beam search, or to train the syllable network along with a LM as proposed in [23].…”

Section: Discussionmentioning

confidence: 99%

A Comparison of Hybrid and End-to-End Models for Syllable Recognition

Bayerl

Riedhammer

2019

Text, Speech, and Dialogue

View full text Add to dashboard Cite

This paper presents a comparison of a traditional hybrid speech recognition system (kaldi using WFST and TDNN with lattice-free MMI) and a lexiconfree end-to-end (TensorFlow implementation of multi-layer LSTM with CTC training) models for German syllable recognition on the Verbmobil corpus. The results show that explicitly modeling prior knowledge is still valuable in building recognition systems. With a strong language model (LM) based on syllables, the structured approach significantly outperforms the end-to-end model. The best word error rate (WER) regarding syllables was achieved using kaldi with a 4gram LM, modeling all syllables observed in the training set. It achieved 10.0% WER w.r.t. the syllables, compared to the end-to-end approach where the best WER was 27.53%. The work presented here has implications for building future recognition systems that operate independent of a large vocabulary, as typically used in a tasks such as recognition of syllabic or agglutinative languages, out-ofvocabulary techniques, keyword search indexing and medical speech processing.

show abstract

“…Cold Fusion (Sriram et al, 2017) deals with this problem by training the sequence-to-sequence model along with the gating mechanism, thus making the model aware of the pre-trained language model throughout the training process. The decoder does not need to learn a language model from scratch, and can thus learn more task-specific language characteristics which are not captured by the pre-trained language model (which has been trained on a much larger, domain-agnostic corpus).…”

Section: Fusion Methodsmentioning

confidence: 99%

“…The output of the DM is similarly concatenated to the input of the linear layer between the encoder and the decoder of the higher-level model. The output of the NLG, in the form of logits at a decoding time-step, is combined with the hidden state of the decoder via cold-fusion (Sriram et al, 2017). Given the NLG output as l N LG t and the higher-level decoder hidden state as s t , the cold-fusion method is described as follows:…”

Section: Structured Fusion Networkmentioning

confidence: 99%

Structured Fusion Networks for Dialog

Mehri

Srinivasan

Eskénazi

2019

Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

View full text Add to dashboard Cite

Neural dialog models have exhibited strong performance, however their end-to-end nature lacks a representation of the explicit structure of dialog. This results in a loss of generalizability, controllability and a datahungry nature. Conversely, more traditional dialog systems do have strong models of explicit structure. This paper introduces several approaches for explicitly incorporating structure into neural models of dialog. Structured Fusion Networks first learn neural dialog modules corresponding to the structured components of traditional dialog systems and then incorporate these modules in a higher-level generative model. Structured Fusion Networks obtain strong results on the MultiWOZ dataset, both with and without reinforcement learning. Structured Fusion Networks are shown to have several valuable properties, including better domain generalizability, improved performance in reduced data scenarios and robustness to divergence during reinforcement learning.

show abstract

Cold Fusion: Training Seq2Seq Models Together with Language Models

Cited by 220 publications

References 17 publications

Hybrid Autoregressive Transducer (HAT)

Hybrid Autoregressive Transducer (HAT)

A Comparison of Hybrid and End-to-End Models for Syllable Recognition

Structured Fusion Networks for Dialog

Contact Info

Product

Resources

About