Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Cho, Jaejin; Watanabe, Shinji; Hori, Takaaki; Baskar, Murali Karthick; Inaguma, Hirofumi; Villalba, Jesús; Dehak, Najim

doi:10.1109/icassp.2019.8683380

Cited by 5 publications

(6 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For all augmentation schemes, we sweep across the rate of dropout in range [0.0, 0.7] to find the optimal level of total regularization. For the baseline, the best result comes from setting it to 0.7 3 , those trained with data augmentation were fairly robust to the dropout rate and achieved their best performance in range of 0.3 -0.6.…”

Section: Tested Augmentation Schemesmentioning

confidence: 99%

“…The traditional reason language models (LMs) appear in ASR systems is that they directly represent the prior term P (S) in the Bayes factorization of the posterior probability P (S|A) of a sentence S given the audio A. However in practice, LMs trained on excessive amounts of data are combined with hybrid and end-to-end systems alike [1,2,3] at authors liberty. Overall, LMs can be seen as a refinement tool to apply on a preliminary result of recognition.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Text Augmentation for Language Models in High Error Recognition Scenario

Beneš¹,

Burget²

2021

Interspeech 2021

View full text Add to dashboard Cite

We examine the effect of data augmentation for training of language models for speech recognition. We compare augmentation based on global error statistics with one based on per-word unigram statistics of ASR errors and observe that it is better to only pay attention the global substitution, deletion and insertion rates. This simple scheme also performs consistently better than label smoothing and its sampled variants. Additionally, we investigate into the behavior of perplexity estimated on augmented data, but conclude that it gives no better prediction of the final error rate. Our best augmentation scheme increases the absolute WER improvement from second-pass rescoring from 1.1 % to 1.9 % absolute on the CHiMe-6 challenge.

show abstract

Section: Tested Augmentation Schemesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Text Augmentation for Language Models in High Error Recognition Scenario

Beneš¹,

Burget²

2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…Cho et al [9] presents a technique (Cell Control Fusion) that is similar to [8], but differs in the aspect of not just fusing the gated outputs of hidden states of the external language model (RNN) but also for the cell states. Hence, a LSTM flavor of the RNN is used as a sequence-to-sequence model in this technique.…”

Section: Sriram Et Al[8] Presents a Technique (Cold Fusionmentioning

confidence: 99%

A Survey on Knowledge integration techniques with Artificial Neural Networks for seq-2-seq/time series models

Vadiraja,

Chattha

2020

Preprint

View full text Add to dashboard Cite

In recent years, with the advent of massive computational power and the availability of huge amounts of data, Deep neural networks have enabled the exploration of uncharted areas in several domains. But at times, they under-perform due to insufficient data, poor data quality, data that might not be covering the domain broadly, etc. Knowledge-based systems leverage expert knowledge for making decisions and suitably take actions. Such systems retain interpretability in the decision-making process. This paper focuses on exploring techniques to integrate expert knowledge to the Deep Neural Networks for sequenceto-sequence and time series models to improve their performance and interpretability.

show abstract

“…Unimodal and multimodal model fusion has been explored extensively in the context of ASR [29,7], Neural Machine Translation (NMT) [12], and hierarchical story generation [11]. However, to the best of our knowledge, there has been no similar works for visual captioning.…”

Section: Fusion Techniques and Variationsmentioning

confidence: 99%

Fusion Models for Improved Visual Captioning

Kalimuthu,

Mogadala,

Mosbach

et al. 2020

Preprint

View full text Add to dashboard Cite

Visual captioning aims to generate textual descriptions given images. Traditionally, the captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them to often make mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders [10] and coherent text generators [4]. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation [11] and automatic speech recognition [29]. Building on these recent developments, and with an aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.

show abstract

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Cited by 5 publications

References 30 publications

Text Augmentation for Language Models in High Error Recognition Scenario

Text Augmentation for Language Models in High Error Recognition Scenario

A Survey on Knowledge integration techniques with Artificial Neural Networks for seq-2-seq/time series models

Fusion Models for Improved Visual Captioning

Contact Info

Product

Resources

About