Proceedings of the 14th Conference on Computational Linguistics - 1992
DOI: 10.3115/992424.992434
|View full text |Cite
|
Sign up to set email alerts
|

Tokenization as the initial phase in NLP

Abstract: In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
133
0
6

Year Published

2006
2006
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 285 publications
(139 citation statements)
references
References 4 publications
0
133
0
6
Order By: Relevance
“…The task may sound simple, but has been the focus of considerable research efforts (e.g. Webster and Kit, 1992;Guo 1997;Wu, 2003).…”
Section: Discussionmentioning
confidence: 99%
“…The task may sound simple, but has been the focus of considerable research efforts (e.g. Webster and Kit, 1992;Guo 1997;Wu, 2003).…”
Section: Discussionmentioning
confidence: 99%
“…Tokenization is considered the first step in Natural Language Processing (henceforth, NLP) and it is broadly defined as the segmentation of text into primary building blocks for subsequent analysis (Webster and Kit, 1992).…”
Section: Introductionmentioning
confidence: 99%
“…", the question focus "movie" can be used to provide supporting indicators to locate the answer in the subsequent process, by seeking for phrases containing the question focus [24]. The different additional NLP pre-processing steps that a question goes through includes stop word removal [25] ,tokenization [26], stemming [27]where words with less importance are removed from the question , and question expansion [28] where synonyms for some terms of the question are added to improve the information retrieval process. If the question contains any temporal signal, the question will be forwarded to the temporal inference module for further processing.…”
Section: The Question Processing Modulementioning
confidence: 99%