Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases Mining Biological Semantics - I 2005
DOI: 10.3115/1641484.1641490
|View full text |Cite
|
Sign up to set email alerts
|

Corpus design for biomedical natural language processing

Abstract: This paper classifies six publicly available biomedical corpora according to various corpus design features and characteristics. We then present usage data for the six corpora. We show that corpora that are carefully annotated with respect to structural and linguistic characteristics and that are distributed in standard formats are more widely used than corpora that are not. These findings have implications for the design of the next generation of biomedical corpora.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
38
0

Year Published

2008
2008
2018
2018

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 64 publications
(38 citation statements)
references
References 19 publications
0
38
0
Order By: Relevance
“…In 2005, Cohen et al [37] already indicated the importance of GE-NIA corpus, presenting it as the most used in the biomed- ical field. Even if NER was not the primary focus of that paper, the authors explained GENIA's predominance in terms of structural and linguistic annotation, which keeps being one of the important factors for expressiveness and popularity of GENIA corpus.…”
Section: Findings Gaps and Opportunitiesmentioning
confidence: 97%
“…In 2005, Cohen et al [37] already indicated the importance of GE-NIA corpus, presenting it as the most used in the biomed- ical field. Even if NER was not the primary focus of that paper, the authors explained GENIA's predominance in terms of structural and linguistic annotation, which keeps being one of the important factors for expressiveness and popularity of GENIA corpus.…”
Section: Findings Gaps and Opportunitiesmentioning
confidence: 97%
“…A controlled vocabulary term may not match precisely, complicating interpretation of the annotation. The phenomenon has been cited as a factor that affects the utility of annotated biomedical corpora (Cohen et al 2005).…”
Section: Factors Influencing Inter-annotator Agreementmentioning
confidence: 99%
“…Even for PPI detection, which is one of the most investigated TM problems, there are only a few standard data sets. The usefulness of these data sets is limited by their size and annotation schema [6,3,22]. In this paper we present a new method that integrates unlabelled data in order to improve performance of a classifier trained on a smaller minimally annotated data set.…”
Section: Introductionmentioning
confidence: 99%