Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 2014
DOI: 10.3115/v1/p14-2111
|View full text |Cite
|
Sign up to set email alerts
|

Normalizing tweets with edit scripts and recurrent neural embeddings

Abstract: Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Netw… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
70
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 73 publications
(71 citation statements)
references
References 11 publications
1
70
0
Order By: Relevance
“…The above setup, features and edit operations are identical to Chrupała (2014) to the best of our knowledge. We further add a character class feature {NULL, control, space, apostrophe, punctuation, digit, quote, bracket, lowercase letter, uppercase letter, non-ASCII, other} and a feature indicating whether the character is part of a token that is eligible for editing according to the shared task 1 https://bitbucket.org/gchrupala/elman 2 http://rnnlm.org/ 3 https://bitbucket.org/gchrupala/ codeswitch/overview 4 More precisely, we process UTF-8 bytes.…”
Section: Feature Extractionmentioning
confidence: 99%
See 2 more Smart Citations
“…The above setup, features and edit operations are identical to Chrupała (2014) to the best of our knowledge. We further add a character class feature {NULL, control, space, apostrophe, punctuation, digit, quote, bracket, lowercase letter, uppercase letter, non-ASCII, other} and a feature indicating whether the character is part of a token that is eligible for editing according to the shared task 1 https://bitbucket.org/gchrupala/elman 2 http://rnnlm.org/ 3 https://bitbucket.org/gchrupala/ codeswitch/overview 4 More precisely, we process UTF-8 bytes.…”
Section: Feature Extractionmentioning
confidence: 99%
“…We use the off-the-shelf model from Chrupała (2014) 3 . The input are the characters of the tweet 4 in one-hot encoding.…”
Section: Feature Extractionmentioning
confidence: 99%
See 1 more Smart Citation
“…al. 2010;Liu et al 2011;Han et al, 2013;Bali, 2013;Chrupała, 2014) and longer UGC texts, such as reviews and blogs, have much in common, but the differences are sufficiently significant to justify addressing them separately.…”
Section: Related Workmentioning
confidence: 99%
“…Neural Network (Elman): We extract features from the hidden layer of a recurrent neural net- Table 5: Average cross-validation accuracy of 6-way SVMs of combinations of GDLC, k-NN, Elman and P 1 N 1 features for Nepali-English work that has been trained to predict the next character in a string (Chrupała, 2014). The 10 most active units of the hidden layer for each of the initial 4 bytes and final 4 bytes of each token are binarised by using a threshold of 0.5.…”
Section: Neural Network (Elman) and K-nn Featuresmentioning
confidence: 99%