Normalizing tweets with edit scripts and recurrent neural embeddings

Chrupała, Grzegorz

doi:10.3115/v1/p14-2111

Cited by 73 publications

(71 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The above setup, features and edit operations are identical to Chrupała (2014) to the best of our knowledge. We further add a character class feature {NULL, control, space, apostrophe, punctuation, digit, quote, bracket, lowercase letter, uppercase letter, non-ASCII, other} and a feature indicating whether the character is part of a token that is eligible for editing according to the shared task 1 https://bitbucket.org/gchrupala/elman 2 http://rnnlm.org/ 3 https://bitbucket.org/gchrupala/ codeswitch/overview 4 More precisely, we process UTF-8 bytes.…”

Section: Feature Extractionmentioning

confidence: 99%

“…We use the off-the-shelf model from Chrupała (2014) 3 . The input are the characters of the tweet 4 in one-hot encoding.…”

Section: Feature Extractionmentioning

confidence: 99%

“…For extracting recurrent neural network language model features, we use Elman 1 (Chrupała, 2014), a modification of the RNNLM toolkit 2 (Mikolov et al, 2010;Mikolov, 2012) that outputs hidden layer activations. We use the off-the-shelf model from Chrupała (2014) 3 .…”

Section: Feature Extractionmentioning

confidence: 99%

See 2 more Smart Citations

DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron

Wagner¹,

Foster²

2015

Proceedings of the Workshop on Noisy User-Generated Text

View full text Add to dashboard Cite

We describe the work carried out by the DCU-ADAPT team on the Lexical Normalisation shared task at W-NUT 2015. We train a generalised perceptron to annotate noisy text with edit operations that normalise the text when executed. Features are character n-grams, recurrent neural network language model hidden layer activations, character class and eligibility for editing according to the task rules. We combine predictions from 25 models trained on subsets of the training data by selecting the most-likely normalisation according to a character language model. We compare the use of a generalised perceptron to the use of conditional random fields restricted to smaller amounts of training data due to memory constraints. Furthermore, we make a first attempt to verify Chrupała (2014)'s hypothesis that the noisy channel model would not be useful due to the limited amount of training data for the source language model, i.e. the language model on normalised text.

show abstract

Section: Feature Extractionmentioning

confidence: 99%

“…We use the off-the-shelf model from Chrupała (2014) 3 . The input are the characters of the tweet 4 in one-hot encoding.…”

Section: Feature Extractionmentioning

confidence: 99%

See 1 more Smart Citation

DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron

Wagner¹,

Foster²

2015

Proceedings of the Workshop on Noisy User-Generated Text

View full text Add to dashboard Cite

show abstract

“…al. 2010;Liu et al 2011;Han et al, 2013;Bali, 2013;Chrupała, 2014) and longer UGC texts, such as reviews and blogs, have much in common, but the differences are sufficiently significant to justify addressing them separately.…”

Section: Related Workmentioning

confidence: 99%

A Normalizer for UGC in Brazilian Portuguese

Duran¹,

Nunes²,

Avanço

2015

Proceedings of the Workshop on Noisy User-Generated Text

View full text Add to dashboard Cite

User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.

show abstract

“…Neural Network (Elman): We extract features from the hidden layer of a recurrent neural net- Table 5: Average cross-validation accuracy of 6-way SVMs of combinations of GDLC, k-NN, Elman and P 1 N 1 features for Nepali-English work that has been trained to predict the next character in a string (Chrupała, 2014). The 10 most active units of the hidden layer for each of the initial 4 bytes and final 4 bytes of each token are binarised by using a threshold of 0.5.…”

Section: Neural Network (Elman) and K-nn Featuresmentioning

confidence: 99%

DCU-UVT: Word-Level Language Classification with Code-Mixed Data

Barman¹,

Wagner²,

Chrupała³

et al. 2014

Proceedings of the First Workshop on Computational Approaches to Code Switching

Self Cite

View full text Add to dashboard Cite

This paper describes the DCU-UVT team's participation in the Language Identification in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching. Wordlevel classification experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k-nearest neighbour approach. Based on these experiments, we select our SVM-based system with contextual clues as our final system and present results for the Nepali-English and Spanish-English datasets.

show abstract

Normalizing tweets with edit scripts and recurrent neural embeddings

Cited by 73 publications

References 11 publications

DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron

DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron

A Normalizer for UGC in Brazilian Portuguese

DCU-UVT: Word-Level Language Classification with Code-Mixed Data

Contact Info

Product

Resources

About