Tokenization and proper noun recognition for information retrieval

Barcala, Francisco-Mario; Vilares, Jesús; Alonso, Miguel Ángel Verdugo; Graña, Jorge; Vilares, Manuel

doi:10.1109/dexa.2002.1045906

Cited by 12 publications

(6 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…F. M. Barcala, in this paper, considers a set of natural language processing processes that can be used to survey huge amounts of text chats, focusing on the modernized tokenizer which sums for a number of complex linguistic phenomena, as well as for pre-tagging tasks such as proper noun recognition [12].…”

Section: Literature Surveymentioning

confidence: 99%

Fight Against COVID-19 Pandemic Using Chat-Bots

Singh

Virdi

Choudhary³

2021

Impact of AI and Data Science in Response to Coronavirus Pandemic

View full text Add to dashboard Cite

Section: Literature Surveymentioning

confidence: 99%

Fight Against COVID-19 Pandemic Using Chat-Bots

Singh

Virdi

Choudhary³

2021

Impact of AI and Data Science in Response to Coronavirus Pandemic

View full text Add to dashboard Cite

“…all refer to the same entity. Considerable research has been done to resolve each of these names to the same person [9].…”

Section: Named Entitiesmentioning

confidence: 99%

Web Searching with Multiple Correct Answers

O'Hara

Bylander

2014

Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

View full text Add to dashboard Cite

Most web search engines today are geared towards providing a list of relevant websites, along with snippets of text from each website that are relevant to the user's search text. Some of them may also provide specific answers to the user's question. This paper explores techniques for combining many candidate answers into a small set of answers, when it is likely that there is more than one correct answer. We describe and test new algorithms that collect and consolidate candidate answers from many different websites using paired support, reinforcing the ranking factor of those candidate answers that co-occur as pairs across multiple website domains.

show abstract

“…Hence words do not contain uppercase letters in between, it is marked as irregular by a rule. Figure 5 shows the effect of extended tokenization on state-of-the-art Straight n x n mapping doesn t fit into ... [5,7].…”

Section: Sample Tokenization Outputsmentioning

confidence: 99%

“…Straight n x n mapping does not fit into ... [5,7] . • Direct changes of tags through empirically more adequate input units • Indirect changes of tags through changes of linguistic contexts…”

Section: Tagging Optimization Using Javatokmentioning

confidence: 99%

See 1 more Smart Citation

Text preparation through extended tokenization

Hassler

Fliedl

2006

WIT Transactions on Information and Communication Technologies, Vol 37

View full text Add to dashboard Cite

Tokenization is commonly understood as the first step of any kind of natural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of processing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle morphosyntactically relevant linguistic specificities. Therefore, we propose rule-based extended tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core features of our implementation are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including single-and multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks.In this paper, we focus on the task of improving the quality of standard tagging.

show abstract

Tokenization and proper noun recognition for information retrieval

Abstract: Abstract

Cited by 12 publications

References 9 publications

Fight Against COVID-19 Pandemic Using Chat-Bots

Fight Against COVID-19 Pandemic Using Chat-Bots

Web Searching with Multiple Correct Answers

Text preparation through extended tokenization

Contact Info

Product

Resources

About