Proceedings of the Workshop on Noisy User-Generated Text 2015
DOI: 10.18653/v1/w15-4305
|View full text |Cite
|
Sign up to set email alerts
|

A Normalizer for UGC in Brazilian Portuguese

Abstract: User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 10 publications
0
2
0
Order By: Relevance
“…Qian et al 2015 use a transition-based model for joint segmentation, POS-tagging and normalization for the Chinese language. Duran et al 2015 propose a lexicon-based tool for user-generated content (UGC) normalization in Brazilian Portuguese. In recent years, we observe many studies focusing on social media text normalization for Spanish, which is a morphologically more interesting language (also carrying some similar challenges to Turkish such as the importance of diacritic restoration in social media text normalization).…”
Section: Related Workmentioning
confidence: 99%
“…Qian et al 2015 use a transition-based model for joint segmentation, POS-tagging and normalization for the Chinese language. Duran et al 2015 propose a lexicon-based tool for user-generated content (UGC) normalization in Brazilian Portuguese. In recent years, we observe many studies focusing on social media text normalization for Spanish, which is a morphologically more interesting language (also carrying some similar challenges to Turkish such as the importance of diacritic restoration in social media text normalization).…”
Section: Related Workmentioning
confidence: 99%
“…Text normalization refers to the process of modifying the format of a text to communicate ideas for a particular purpose. [12,13]. Text normalization is crucial to employ since a text or reading will often contain words that are challenging to grasp, ranging from vocabulary used differently in different languages or contexts to sentences that have distinct meanings from their original context [14,15].…”
Section: Introductionmentioning
confidence: 99%