Towards Robust Word Embeddings for Noisy Texts

Doval, Yerai; Vilares, Jesús; Gómez-Rodríguez, Carlos

doi:10.3390/app10196893

Cited by 6 publications

(6 citation statements)

References 32 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Perturbed GLUE Benchmark: To further investigate the impact of lexical normalization tools over the related NLP tasks, we consider 5 subtasks of the popular GLUE benchmark [52]. As the GLUE datasets are of high-quality, we follow previous approaches [19,41] in randomly perturbing the words in the validation and testing dataset while keeping the training set fixed. We generate synthetic lexical errors at 20, 40, and 60% rates of noise such that we perturb a sentence with probability equal to this rate and then select 1-2 characters uniformly at random in every word of the sentence to delete or replace with another random character.…”

Section: Methodsmentioning

confidence: 99%

“…More recent works have introduced unsupervised statistical models for text cleaning [4,15] or combining multiple heuristics to identify and normalize out of vocabulary words [21]. Another explored learning robust word representations through end-to-end neural networks as opposed to normalizing the data beforehand [19,31] or directly fine-tuning the BERT models for lexical normalization task [33]. Another group of works focus on directly learning over the subword level information, where character sequences or subword pairs are directly used for learning the representation without any correction steps [33].…”

Section: Background and Related Work 21 Lexical Normalizationmentioning

confidence: 99%

See 1 more Smart Citation

A Fast Randomized Algorithm for Massive Text Normalization

Jiang

Luo

Lakshman

et al. 2021

Preprint

View full text Add to dashboard Cite

Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. In addition, FLAN does not require any annotated data or supervised learning. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN. Leveraging recent advances in efficiently computing minhash signatures, FLAN requires much less computational time compared to baselines text normalization techniques on large-scale Twitter and Reddit datasets. In a human evaluation of the quality of the normalization, FLAN achieves 5% and 14% improvement against the baselines over the Reddit and Twitter dataset correspondingly. Our method also improves performance on the perturbed GLUE benchmark datasets, where we introduce errors into the text, and Twitter sentiment classification applications.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Background and Related Work 21 Lexical Normalizationmentioning

confidence: 99%

A Fast Randomized Algorithm for Massive Text Normalization

Jiang

Luo

Lakshman

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Overall, the inclusion of the interactive analysis module empowers knowledge engineers with a comprehensive set of visual tools to analyze semantic relationships within the data [28], [29]. This facilitates a more informed approach to synset editing, ultimately enhancing the effectiveness of the WSD system.…”

Section: Interactive Analysis For Informed Synset Editingmentioning

confidence: 99%

Reversal of the Word-Sense Disambiguation Task Using Deep Learning Model

Laukaitis

2024

Preprint

View full text Add to dashboard Cite

Word Sense Disambiguation (WSD) stands as a persistent challenge within the Natural Language Processing (NLP) community. While various NLP packages exist, the Lesk algorithm in the NLTK library, a widely recognized tool, demonstrates suboptimal accuracy. Conversely, the application of deep neural networks offers heightened classification precision, yet their practical utility is constrained by demanding memory requirements. This research paper introduces an innovative method addressing WSD challenges by optimizing memory usage without compromising state-of-the-art accuracy. The presented methodology facilitates the development of WSD system that seamlessly integrates into NLP tasks, resembling the functionality offered by the NLTK library. Furthermore, this paper advocates treating the BERT language model as a gold standard, proposing modifications to manually annotated datasets and semantic dictionaries such as WordNet to enhance WSD accuracy. The empirical validation through a series of experiments establishes the effectiveness of the proposed method, achieving state-of-the-art performance across multiple WSD datasets. This contribution represents advancement in mitigating the challenges associated with WSD, offering a practical solution for integration into NLP applications.

show abstract

“…Moreover, they showed that by replacing WordNet synsets with a small set of upper ontology concepts, it is possible to improve the accuracy of the identification of predicates. It is possible to improve the performance of word embedding using the noisy text [33] approach. This could be an interesting project for future research on word embeddings.…”

Section: Supplementary Materialsmentioning

confidence: 99%

Deep Semantic Parsing with Upper Ontologies

2021

View full text Add to dashboard Cite

This paper presents a new method for semantic parsing with upper ontologies using FrameNet annotations and BERT-based sentence context distributed representations. The proposed method leverages WordNet upper ontology mapping and PropBank-style semantic role labeling and it is designed for long text parsing. Given a PropBank, FrameNet and WordNet-labeled corpus, a model is proposed that annotates the set of semantic roles with upper ontology concept names. These annotations are used for the identification of predicates and arguments that are relevant for virtual reality simulators in a 3D world with a built-in physics engine. It is shown that state-of-the-art results can be achieved in relation to semantic role labeling with upper ontology concepts. Additionally, a manually annotated corpus was created using this new method and is presented in this study. It is suggested as a benchmark for future studies relevant to semantic parsing.

show abstract

Towards Robust Word Embeddings for Noisy Texts

Cited by 6 publications

References 32 publications

A Fast Randomized Algorithm for Massive Text Normalization

A Fast Randomized Algorithm for Massive Text Normalization

Reversal of the Word-Sense Disambiguation Task Using Deep Learning Model

Deep Semantic Parsing with Upper Ontologies

Contact Info

Product

Resources

About