C-Norm: a neural approach to few-shot entity normalization

Ferré, Arnaud; Deléger, Louise; Bossy, Robert; Zweigenbaum, Pierre

doi:10.1186/s12859-020-03886-8

Cited by 14 publications

(6 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Named entity linking . Regarding the NEL evaluation, we assessed the performance of C-Norm [ 42 ] and BioSyn [ 45 ] for the prediction of trait and phenotype classes. We did not evaluate the prediction of species because the size of the NCBI taxonomy is beyond the capacity of the algorithms.…”

Section: Resultsmentioning

confidence: 99%

“…For the NEL task involving traits and phenotypes, we use the C-Norm and BioSyn algorithms. The C-Norm method [ 42 ] achieves state-of-the-art performance on the Bacteria Biotope dataset, which has good similarities to TaeC , i.e., deep ontology and complex entity terms. C-Norm represents terms in the texts using Word2vec embeddings [ 43 ], and it represents ontology classes using vectors that integrate hierarchical information from the ontology [ 44 ].…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature

Nédellec,

Sauvion,

Bossy

et al. 2024

PLoS ONE

View full text Add to dashboard Cite

Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature

Nédellec,

Sauvion,

Bossy

et al. 2024

PLoS ONE

View full text Add to dashboard Cite

show abstract

“…Several strategies were developed and investigated to exploit external lexical and semantic resources to improve machine learning models. These strategies include thematic masking [1] , named entity recognition by distant supervision [2] , and ontology-based normalization [3] . The biological roles of MOs depend mainly on their structure.…”

Section: Value Of the Datamentioning

confidence: 99%

MilkOligoThesaurus, a dataset of mammalian milk oligosaccharide synonyms

Rumeau,

Fenaille,

Girard

et al. 2024

Data in Brief

Self Cite

View full text Add to dashboard Cite

“…INRA (Institut national de la recherche agronomique) and Cnrs (Centre national de la recherche scientifique) at University Paris Saclay proposed a two-step method to normalize multi-word terms with concepts from a domain-specific ontology. In this method, they used vector representations of terms computed with word embedding information and hierarchical information from ontology concepts [16]. Le and Mikolov presented word2vec and later introduced the doc2vec algorithm based on adjusted techniques for learning how to embed texts identical to word2vec, thus turning doc2vec into a branch of word2vec [17].…”

Section: Related Workmentioning

confidence: 99%

A Rule-Based Approach to Embedding Techniques for Text Document Classification

Aubaid

Mishra

2020

Applied Sciences

View full text Add to dashboard Cite

With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.

show abstract

C-Norm: a neural approach to few-shot entity normalization

Cited by 14 publications

References 27 publications

TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature

TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature

MilkOligoThesaurus, a dataset of mammalian milk oligosaccharide synonyms

A Rule-Based Approach to Embedding Techniques for Text Document Classification

Contact Info

Product

Resources

About