Tokenization as the initial phase in NLP

Webster, Jonathan J.; Kit, Chunyu

doi:10.3115/992424.992434

Cited by 285 publications

(139 citation statements)

References 4 publications

Supporting

Mentioning

133

Contrasting

Unclassified

Order By: Relevance

“…The task may sound simple, but has been the focus of considerable research efforts (e.g. Webster and Kit, 1992;Guo 1997;Wu, 2003).…”

Section: Discussionmentioning

confidence: 99%

A flexible approach to natural language generation for disabled children

Biswas

2006

Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Co

View full text Add to dashboard Cite

Natural Language Generation (NLG) is a way to automatically realize a correct expression in response to a communicative goal. This technology is mainly explored in the fields of machine translation, report generation, dialog system etc. In this paper we have explored the NLG technique for another novel applicationassisting disabled children to take part in conversation. The limited physical ability and mental maturity of our intended users made the NLG approach different from others. We have taken a flexible approach where main emphasis is given on flexibility and usability of the system. The evaluation results show this technique can increase the communication rate of users during a conversation.

show abstract

“…The task may sound simple, but has been the focus of considerable research efforts (e.g. Webster and Kit, 1992;Guo 1997;Wu, 2003).…”

Section: Discussionmentioning

confidence: 99%

A flexible approach to natural language generation for disabled children

Biswas

2006

Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Co

View full text Add to dashboard Cite

show abstract

“…Tokenization is considered the first step in Natural Language Processing (henceforth, NLP) and it is broadly defined as the segmentation of text into primary building blocks for subsequent analysis (Webster and Kit, 1992).…”

Section: Introductionmentioning

confidence: 99%

An Analysis of Biomedical Tokenization: Problems and Strategies

Díaz

López

2015

Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

View full text Add to dashboard Cite

Choosing the right tokenizer is a non-trivial task, especially in the biomedical domain, where it poses additional challenges, which if not resolved means the propagation of errors in successive Natural Language Processing analysis pipeline. This paper aims to identify these problematic cases and analyze the output that, a representative and widely used set of tokenizers, shows on them. This work will aid the decision making process of choosing the right strategy according to the downstream application. In addition, it will help developers to create accurate tokenization tools or improve the existing ones. A total of 14 problematic cases were described, showing biomedical samples for each of them. The outputs of 12 tokenizers were provided and discussed in relation to the level of agreement among tools.

show abstract

“…", the question focus "movie" can be used to provide supporting indicators to locate the answer in the subsequent process, by seeking for phrases containing the question focus [24]. The different additional NLP pre-processing steps that a question goes through includes stop word removal [25] ,tokenization [26], stemming [27]where words with less importance are removed from the question , and question expansion [28] where synonyms for some terms of the question are added to improve the information retrieval process. If the question contains any temporal signal, the question will be forwarded to the temporal inference module for further processing.…”

Section: The Question Processing Modulementioning

confidence: 99%

Developing an Intelligent Question Answering System

Ahmed¹,

Dasan²,

P³

2017

IJEME

View full text Add to dashboard Cite

The goal of an intelligent answering system is that the system can respond to questions automatically. For developing such kind of system, it should be able to answer, and store these questions along with their answers. Our intelligent QA (iQA) system for Arabic language will be growing automatically when users ask new questions and the system will be accumulating these new question-answer pairs in its database. This will speed up the processing when the same question(even if it is in different syntactical structure but semantically same) is being asked again in the future. The source of knowledge of our system is the World Wide Web(WWW). The system can also understand and respond to more sophisticated questions that need a kind of temporal inference.

show abstract

Tokenization as the initial phase in NLP

Cited by 285 publications

References 4 publications

A flexible approach to natural language generation for disabled children

A flexible approach to natural language generation for disabled children

An Analysis of Biomedical Tokenization: Problems and Strategies

Developing an Intelligent Question Answering System

Contact Info

Product

Resources

About