Normalization of non-standard words

Sproat, Richard; Black, Alan W.; Chen, Stanley; Kumar, Shankar; Ostendorf, Mari; Richards, Christopher D.

doi:10.1006/csla.2001.0169

Cited by 255 publications

(183 citation statements)

References 16 publications

Supporting

Mentioning

173

Contrasting

Unclassified

Order By: Relevance

“…Though our findings are not fully comparable to those in the two previous references, we can see that our blog corpus is the one that presents the lowest deviation rate -comparable to newspaper text if we take into account that Sproat et al (2001) were looking to a wider variety of non-standard forms. In contrast, our other corpora present very high rates of deviationswhich are in line with the findings of both Sproat et al (2001) and Han and Baldwin (2011) in their less formal types of texts.…”

Section: Characteristics Of Spanish Ugc and English Ugccontrasting

confidence: 98%

“…In contrast, our other corpora present very high rates of deviationswhich are in line with the findings of both Sproat et al (2001) and Han and Baldwin (2011) in their less formal types of texts.…”

Section: Characteristics Of Spanish Ugc and English Ugcsupporting

confidence: 89%

“…While Sproat et al (2001) work with four different types of text (newspaper text, real estate ads, and servlist texts on the topics of palmtop computers and cooking recipes), most later papers deal either with SMS texts -see for instance Choudhury et al (2007), Kobus et al (2008), or Cook and Stevenson (2009) -or Twitter text -see for instance Clark and Araki (2011), Brody and Diakopoulos (2011), Foster et al (2011), Han and Baldwin (2011), Hassan and Menezes (2013 or Eisenstein (2013). Also, Liu et al (2012) have worked on both SMS and Twitter datasets.…”

Section: Text Normalization As a Taskmentioning

confidence: 99%

See 2 more Smart Citations

Selection of correction candidates for the normalization of Spanish user-generated content

et al. 2014

View full text Add to dashboard Cite

We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing (NLP), since UGC (micro-blog, blog, and, generally, Web 2.0 user generated texts) presents a number of non-standard communicative and linguistic characteristics -often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging.

show abstract

Section: Characteristics Of Spanish Ugc and English Ugccontrasting

confidence: 98%

Section: Characteristics Of Spanish Ugc and English Ugcsupporting

confidence: 89%

Section: Text Normalization As a Taskmentioning

confidence: 99%

See 1 more Smart Citation

Selection of correction candidates for the normalization of Spanish user-generated content

et al. 2014

View full text Add to dashboard Cite

show abstract

“…Next, for both data sets all the words in the sentences are labeled with parts of speech (POS) and named entities (NE). Finally, to ensure the integrity of the Twitter data, English language filtering † and non-standard word (NSW) normalization [14] is also performed.…”

Section: Preprocessingmentioning

confidence: 99%

Utilizing Human-to-Human Conversation Examples for a Multi Domain Chat-Oriented Dialog System

Nio

Sakti

Neubig

et al. 2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Lasguido NIO†a) , Nonmember, Sakriani SAKTI †b) , Member, Graham NEUBIG †c) , Nonmember, Tomoki TODA †d) , and Satoshi NAKAMURA †e) , Members SUMMARY This paper describes the design and evaluation of a method for developing a chat-oriented dialog system by utilizing real human-to-human conversation examples from movie scripts and Twitter conversations. The aim of the proposed method is to build a conversational agent that can interact with users in as natural a fashion as possible, while reducing the time requirement for database design and collection. A number of the challenging design issues we faced are described, including (1) constructing an appropriate dialog corpora from raw movie scripts and Twitter data, and (2) developing an multi domain chat-oriented dialog management system which can retrieve a proper system response based on the current user query. To build a dialog corpus, we propose a unit of conversation called a tri-turn (a trigram conversation turn), as well as extraction and semantic similarity analysis techniques to help ensure that the content extracted from raw movie/drama script files forms appropriate dialog-pair (query-response) examples. The constructed dialog corpora are then utilized in a data-driven dialog management system. Here, various approaches are investigated including example-based (EBDM) and response generation using phrase-based statistical machine translation (SMT). In particular, we use two EBDM: syntactic-semantic similarity retrieval and TF-IDF based cosine similarity retrieval. Experiments are conducted to compare and contrast EBDM and SMT approaches in building a chat-oriented dialog system, and we investigate a combined method that addresses the advantages and disadvantages of both approaches. System performance was evaluated based on objective metrics (semantic similarity and cosine similarity) and human subjective evaluation from a small user study. Experimental results show that the proposed filtering approach effectively improve the performance. Furthermore, the results also show that by combing both EBDM and SMT approaches, we could overcome the shortcomings of each. key words: dialog corpora, response generation, example-based dialog modeling, semantic similarity, cosine similarity, machine translation IntroductionThe continuous growth of information technology is having an increasingly large impact on many aspects of our daily lives. The issue of communication via speech between human beings and information-processing machines is also becoming more important [1]. A common dream is to realize a technology that allows humans to communicate or have dialogs with machines through natural and spontaneous speech. ). Dialog systems can also be described by the amount of human intervention used in their construction, ranging from entirely hand-made to completely data-driven. Seminal work often limited interactions to a specific scenario (e.g. a Rogerian psychotherapist [4]) or were based on complex, knowledgerich rule-based systems for generating responses, which requi...

show abstract

“…Text normalization (Sproat et al, 2001) is an important initial phase for many natural language and speech applications. The basic task of text normalization is to convert non-standard words (NSWs) -numbers, abbreviations, dates, etc.…”

Section: Introductionmentioning

confidence: 99%

Hippocratic Abbreviation Expansion

Roark

Sproat

2014

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Self Cite

View full text Add to dashboard Cite

Incorrect normalization of text can be particularly damaging for applications like text-to-speech synthesis (TTS) or typing auto-correction, where the resulting normalization is directly presented to the user, versus feeding downstream applications. In this paper, we focus on abbreviation expansion for TTS, which requires a "do no harm", high precision approach yielding few expansion errors at the cost of leaving relatively many abbreviations unexpanded. In the context of a largescale, real-world TTS scenario, we present methods for training classifiers to establish whether a particular expansion is apt. We achieve a large increase in correct abbreviation expansion when combined with the baseline text normalization component of the TTS system, together with a substantial reduction in incorrect expansions.

show abstract

Normalization of non-standard words

Cited by 255 publications

References 16 publications

Selection of correction candidates for the normalization of Spanish user-generated content

Selection of correction candidates for the normalization of Spanish user-generated content

Utilizing Human-to-Human Conversation Examples for a Multi Domain Chat-Oriented Dialog System

Hippocratic Abbreviation Expansion

Contact Info

Product

Resources

About