2016
DOI: 10.1155/2016/4248026
|View full text |Cite
|
Sign up to set email alerts
|

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

Abstract: Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(16 citation statements)
references
References 18 publications
0
16
0
Order By: Relevance
“…The effect of tokenization on NER performance has been shown in the past (Akkasi et al, 2016;Xu et al, 2018). For this reason, we evaluated our model trained on the original training data, using various custom tokenization functions, and saw the strict micro-F1 score vary from 72% to 77% in the validation set.…”
Section: Effects Of Tokenizationmentioning
confidence: 94%
“…The effect of tokenization on NER performance has been shown in the past (Akkasi et al, 2016;Xu et al, 2018). For this reason, we evaluated our model trained on the original training data, using various custom tokenization functions, and saw the strict micro-F1 score vary from 72% to 77% in the validation set.…”
Section: Effects Of Tokenizationmentioning
confidence: 94%
“…To investigate the effect of a general domain tokenizer, following Habibi et al (2017), we also use the OpenNLP tokenizer. To investigate whether NER performance will be affected by tokenization quality, we employ three tokenizers optimized for chemical texts including ChemTok (Akkasi et al, 2016), OSCAR4 (Jessop et al, 2011) and NBIC UMLSGeneChemTokenizer. 1…”
Section: Tokenizersmentioning
confidence: 99%
“…For example, Riaz [11] developed an Urdu rule-based NER system that designed Urdu language pattern rules, such as the honorific title for person entities, suffix rules for location entities, and so on. Akkasi et al [12] constructed chemical-specific affixes (e.g., Hyper, Anti, and Amino) to detect the beginnings of mentions and used merged rules to detect the endings of mentions. Salah and Zakaria [20] summarized Arabic rulebased NER systems for Arabic language writing patterns, including grammar rules, heuristics rules, and morphological rules.…”
Section: A Entity Discoverymentioning
confidence: 99%
“…The rules are domain-independent because they are generated from the structural information of question representations. But common rule-based ED methods [11], [12] rely on the characteristics of entity types (e.g., persons, organizations, and locations) to build rules. The mention generation module also integrates the extracted mentions into an ED model to alleviate the insufficiency of annotated datasets.…”
Section: Introductionmentioning
confidence: 99%