2020
DOI: 10.3390/app10196893
|View full text |Cite
|
Sign up to set email alerts
|

Towards Robust Word Embeddings for Noisy Texts

Abstract: Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 32 publications
(45 reference statements)
0
6
0
Order By: Relevance
“…Perturbed GLUE Benchmark: To further investigate the impact of lexical normalization tools over the related NLP tasks, we consider 5 subtasks of the popular GLUE benchmark [52]. As the GLUE datasets are of high-quality, we follow previous approaches [19,41] in randomly perturbing the words in the validation and testing dataset while keeping the training set fixed. We generate synthetic lexical errors at 20, 40, and 60% rates of noise such that we perturb a sentence with probability equal to this rate and then select 1-2 characters uniformly at random in every word of the sentence to delete or replace with another random character.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Perturbed GLUE Benchmark: To further investigate the impact of lexical normalization tools over the related NLP tasks, we consider 5 subtasks of the popular GLUE benchmark [52]. As the GLUE datasets are of high-quality, we follow previous approaches [19,41] in randomly perturbing the words in the validation and testing dataset while keeping the training set fixed. We generate synthetic lexical errors at 20, 40, and 60% rates of noise such that we perturb a sentence with probability equal to this rate and then select 1-2 characters uniformly at random in every word of the sentence to delete or replace with another random character.…”
Section: Methodsmentioning
confidence: 99%
“…More recent works have introduced unsupervised statistical models for text cleaning [4,15] or combining multiple heuristics to identify and normalize out of vocabulary words [21]. Another explored learning robust word representations through end-to-end neural networks as opposed to normalizing the data beforehand [19,31] or directly fine-tuning the BERT models for lexical normalization task [33]. Another group of works focus on directly learning over the subword level information, where character sequences or subword pairs are directly used for learning the representation without any correction steps [33].…”
Section: Background and Related Work 21 Lexical Normalizationmentioning
confidence: 99%
“…Overall, the inclusion of the interactive analysis module empowers knowledge engineers with a comprehensive set of visual tools to analyze semantic relationships within the data [28], [29]. This facilitates a more informed approach to synset editing, ultimately enhancing the effectiveness of the WSD system.…”
Section: Interactive Analysis For Informed Synset Editingmentioning
confidence: 99%
“…Moreover, they showed that by replacing WordNet synsets with a small set of upper ontology concepts, it is possible to improve the accuracy of the identification of predicates. It is possible to improve the performance of word embedding using the noisy text [33] approach. This could be an interesting project for future research on word embeddings.…”
Section: Supplementary Materialsmentioning
confidence: 99%