2014
DOI: 10.1371/journal.pone.0107477
|View full text |Cite
|
Sign up to set email alerts
|

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Abstract: Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
50
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 54 publications
(50 citation statements)
references
References 21 publications
0
50
0
Order By: Relevance
“…Chemical entities were manually annotated for 40 complete patent documents and normalized to CHEBI identifiers. The fourth corpus, noted here BioS, was prepared by the BioSemantics6 research group and covers 200 full patent documents [19]. For this corpus, patents were automatically pre-annotated and then manually curated by at least one annotator group consisting of two to ten annotators.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Chemical entities were manually annotated for 40 complete patent documents and normalized to CHEBI identifiers. The fourth corpus, noted here BioS, was prepared by the BioSemantics6 research group and covers 200 full patent documents [19]. For this corpus, patents were automatically pre-annotated and then manually curated by at least one annotator group consisting of two to ten annotators.…”
Section: Methodsmentioning
confidence: 99%
“…The annotation guidelines vary in several aspects. For instance, the IUPAC name “water” should not be annotated as a chemical in the CEMP corpora but it should be in the BioS corpus [19]. Additionally, simple chemical elements are annotated in the CEMP corpora but not in the BioS corpus.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…It divides the chemical names into several classes including ABBREVIATION, IDENTIFIER, FAMILY, FORMULA, MULTIPLE, SYSTEMATIC, TRIVIAL and NO_CLASS. CHEMDNER-patents (10) and Akhondi et al’s corpus (11) are CNR corpora of chemical patents. As previously mentioned, CHEMDNER-patents corpus was created using the same annotation platform and chemical classes as the CHEMDNER corpus with some additional rules.…”
Section: Related Workmentioning
confidence: 99%
“…11,12 Extracting such information from drawn structural formulas would be even more helpful and has been pursued for years, but is still highly experimental. 13,14 In effect, only an annotated chemical patent corpus 15 could provide a truly reliable data basis for automated identification of repurposing patents.…”
Section: Potential Further Developmentsmentioning
confidence: 99%