Findings of the Association for Computational Linguistics: ACL 2022 2022
DOI: 10.18653/v1/2022.findings-acl.116
|View full text |Cite
|
Sign up to set email alerts
|

Thai Nested Named Entity Recognition Corpus

Abstract: This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 11 publications
0
1
0
Order By: Relevance
“…Moreover, the annotated entities in these corpora are limited to the most common entity types such as drugs/chemicals and diseases ( Leaman et al 2009 ; Gurulingappa et al 2010 ; Van Mulligen et al 2012 ; Wei et al 2016 ); see an overview of 20+ NER biomedical datasets in a BIGBIO (BigScience Biomedical) library ( Fries et al 2022 ) for more information. Recent work has shown an increased interest in nested entity structures in general-domain data in various languages, including English ( Ringland et al 2019 ), Russian ( Loukachevitch et al 2021 ), Thai ( Buaphet et al 2022 ), and Danish ( Plank et al 2020 ). The most widely studied corpus for nested NER in the biomedical domain is GENIA ( Kim et al 2003 ) which consists of 2000 PubMed abstracts and 100 000 annotations divided into 47 entity types.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, the annotated entities in these corpora are limited to the most common entity types such as drugs/chemicals and diseases ( Leaman et al 2009 ; Gurulingappa et al 2010 ; Van Mulligen et al 2012 ; Wei et al 2016 ); see an overview of 20+ NER biomedical datasets in a BIGBIO (BigScience Biomedical) library ( Fries et al 2022 ) for more information. Recent work has shown an increased interest in nested entity structures in general-domain data in various languages, including English ( Ringland et al 2019 ), Russian ( Loukachevitch et al 2021 ), Thai ( Buaphet et al 2022 ), and Danish ( Plank et al 2020 ). The most widely studied corpus for nested NER in the biomedical domain is GENIA ( Kim et al 2003 ) which consists of 2000 PubMed abstracts and 100 000 annotations divided into 47 entity types.…”
Section: Introductionmentioning
confidence: 99%