Neural Text Normalization with Subword Units

Mansfield, Courtney; Sun, Ming; Liu, Yuzong; Gandhe, Ankur; Hoffmeister, Björn

doi:10.18653/v1/n19-2024

Cited by 39 publications

(28 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Truecasing, as a part of text normalization, is peculiar in that its bulk can be solved simply by a few hand-written rules, with however a long tail of very difficult cases such as acronyms, unseen words. Finding a proper balance between the flexibility of neural approaches, and the controlled, more interpretable behaviour of FST-based systems, remains an open and challenging problem (Mansfield et al (2019), Sproat and Jaitly (2016), Zhang et al (2019)).…”

Section: Discussionmentioning

confidence: 99%

Truecasing German user-generated conversational text

Grishina

Gueudré²,

Winkler³

2020

Proceedings of the Sixth Workshop on Noisy User-Generated Text (W-Nut 2020)

View full text Add to dashboard Cite

True-casing, the task of restoring proper case to (generally) lower case input, is important in downstream tasks and for screen display. In this paper, we investigate truecasing as an intrinsic task and present several experiments on noisy user queries to a voice-controlled dialog system. In particular, we compare a rulebased, an n-gram language model (LM) and a recurrent neural network (RNN) approaches, evaluating the results on a German Q&A corpus and reporting accuracy for different case categories. We show that while RNNs reach higher accuracy especially on large datasets, character n-gram models with interpolation are still competitive, in particular on mixedcase words where their fall-back mechanisms come into play.

show abstract

Section: Discussionmentioning

confidence: 99%

Truecasing German user-generated conversational text

Grishina

Gueudré²,

Winkler³

2020

Proceedings of the Sixth Workshop on Noisy User-Generated Text (W-Nut 2020)

View full text Add to dashboard Cite

show abstract

“…It is common to model the text normalization problem as a Machine Translation problem (Mansfield et al, 2019) (Lusetti et al, 2018) (Filip et al, 2006) (Zhang et al, 2019). Given that Bidirectional LSTM with attention is a popular baseline model for machine translation task, we built a text normalization model using the same on the lines of work by (Bahdanau et al, 2014).…”

Section: Benchmark Baselinementioning

confidence: 99%

hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Text Normalization

Makhija¹,

Kumar²,

Gupta

2020

Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

View full text Add to dashboard Cite

We present hinglishNorm -a human annotated corpus of Hindi-English code-mixed sentences for text normalization task. Each sentence in the corpus is aligned to its corresponding human annotated normalized form. To the best of our knowledge, there is no corpus of Hindi-English code-mixed sentences for text normalization task that is publicly available. Our work is the first attempt in this direction. The corpus contains 13494 segments annotated for text normalization. Further, we present baseline normalization results on this corpus. We obtain a Word Error Rate (WER) of 15.55, BiLingual Evaluation Understudy (BLEU) score of 71.2, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.50.

show abstract

“…There is still not much work done in the area of context-aware normalisers. Mansfield et al (2019) proposed to use sequence-to-sequence models to normalise full sentences for conversational systems. Jurish (2010) proposed to use hidden markov models to choose over the normalised candidates in a sentential context.…”

Section: Related Workmentioning

confidence: 99%

“…This can be captured systematically by machine learning algorithms and applied to unseen words. Thus, the current state-of-the-art approaches to the historical normalisation rely on statistical or neural machine translation methods and define the task as a problem of translating between characters or substrings (Mansfield et al, 2019) instead of words.…”

Section: Introductionmentioning

confidence: 99%

Context-Aware Text Normalisation for Historical Dialects

Sukhareva

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Context-aware historical text normalisation is a severely under-researched area. To fill the gap we propose a context-aware normalisation approach that relies on the state-of-the-art methods in neural machine translation and transfer learning. We propose a multidialect normaliser with a context-aware reranking of the candidates. The reranker relies on a word-level n-gram language model that is applied to the five best normalisation candidates. The results are evaluated on the historical multidialect datasets of German, Spanish, Portuguese and Slovene. We show that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets. The context-aware reranking gives further improvement over the baseline. For three out of six datasets, we reach a significantly higher accuracy than reported in the previous studies. The other three results are comparable with the current state-of-the-art. The code for the reranker is published as open-source 1 .

show abstract

Neural Text Normalization with Subword Units

Cited by 39 publications

References 11 publications

Truecasing German user-generated conversational text

Truecasing German user-generated conversational text

hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Text Normalization

Context-Aware Text Normalisation for Historical Dialects

Contact Info

Product

Resources

About