Corpus design for biomedical natural language processing

Cohen, K. Bretonnel; Fox, Lynne M.; Ogren, Philip V.; Hunter, Lawrence

doi:10.3115/1641484.1641490

Cited by 64 publications

(38 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In 2005, Cohen et al [37] already indicated the importance of GE-NIA corpus, presenting it as the most used in the biomed- ical field. Even if NER was not the primary focus of that paper, the authors explained GENIA's predominance in terms of structural and linguistic annotation, which keeps being one of the important factors for expressiveness and popularity of GENIA corpus.…”

Section: Findings Gaps and Opportunitiesmentioning

confidence: 97%

A systematic review of named entity recognition in biomedical texts

2011

View full text Add to dashboard Cite

Biomedical Named Entities (NEs) are phrases or combinations of phrases that denote specific objects or groups of objects in the biomedical literature. Research on Named Entity Recognition (NER) is one of the most disseminated activities in the automatic processing of biomedical scientific articles. We analyzed articles relevant to NER in biomedical texts, in the period from 2007 to 2009, through a systematic review. The results identify the main methods in the recognition of Biomedical NEs, features and methodologies for a NER system implementation. Aside from the tendencies identified, some gaps are detected that may constitute opportunities for new studies in the area.

show abstract

Section: Findings Gaps and Opportunitiesmentioning

confidence: 97%

A systematic review of named entity recognition in biomedical texts

2011

View full text Add to dashboard Cite

show abstract

“…A controlled vocabulary term may not match precisely, complicating interpretation of the annotation. The phenomenon has been cited as a factor that affects the utility of annotated biomedical corpora (Cohen et al 2005).…”

Section: Factors Influencing Inter-annotator Agreementmentioning

confidence: 99%

Tasks, topics and relevance judging for the TREC Genomics Track: five years of experience evaluating biomedical text information retrieval systems

2008

View full text Add to dashboard Cite

With the help of a team of expert biologist judges, the TREC Genomics track has generated four large sets of ''gold standard'' test collections, comprised of over a hundred unique topics, two kinds of ad hoc retrieval tasks, and their corresponding relevance judgments. Over the years of the track, increasingly complex tasks necessitated the creation of judging tools and training guidelines to accommodate teams of part-time shortterm workers from a variety of specialized biological scientific backgrounds, and to address consistency and reproducibility of the assessment process. Important lessons were learned about factors that influenced the utility of the test collections including topic design, annotations provided by judges, methods used for identifying and training judges, and providing a central moderator ''meta-judge''.

show abstract

“…Even for PPI detection, which is one of the most investigated TM problems, there are only a few standard data sets. The usefulness of these data sets is limited by their size and annotation schema [6,3,22]. In this paper we present a new method that integrates unlabelled data in order to improve performance of a classifier trained on a smaller minimally annotated data set.…”

Section: Introductionmentioning

confidence: 99%

Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics

Polajnar

Girolami

2009

Pattern Recognition in Bioinformatics

View full text Add to dashboard Cite

Abstract. Protein-protein interaction (PPI) identification is an integral component of many biomedical research and database curation tools. Automation of this task through classification is one of the key goals of text mining (TM). However, labelled PPI corpora required to train classifiers are generally small. In order to overcome this sparsity in the training data, we propose a novel method of integrating corpora that do not contain relevance judgements. Our approach uses a semantic language model to gather word similarity from a large unlabelled corpus. This additional information is integrated into the sentence classification process using kernel transformations and has a re-weighting effect on the training features that leads to an 8% improvement in F-score over the baseline results. Furthermore, we discover that some words which are generally considered indicative of interactions are actually neutralised by this process.

show abstract

Corpus design for biomedical natural language processing

Cited by 64 publications

References 19 publications

A systematic review of named entity recognition in biomedical texts

A systematic review of named entity recognition in biomedical texts

Tasks, topics and relevance judging for the TREC Genomics Track: five years of experience evaluating biomedical text information retrieval systems

Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics

Contact Info

Product

Resources

About