LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools

Hemati, Wahed; Mehler, Alexander

doi:10.1186/s13321-018-0327-2

Cited by 36 publications

(25 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…There are methods aimed at NER that have been developing during the last years (Kaewphan et al, 2018 ; Korvigo et al, 2018 ; Hemati and Mehler, 2019 ; Hong and Lee, 2020 ; Huang et al, 2020 ; Kilicoglu et al, 2020 ). Most of them are based on algorithms for NER related either to chemicals or biological objects.…”

Section: Introductionmentioning

confidence: 99%

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies

et al. 2020

View full text Add to dashboard Cite

Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.

show abstract

Section: Introductionmentioning

confidence: 99%

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Early techniques for chemical text mining, such as dictionary-based methods (Rebholz-Schuhmann et al, 2007 ; Hettne et al, 2009 ; Akhondi et al, 2016 ) and grammar-based methods (Narayanaswamy et al, 2002 ; Liu et al, 2012 ; Akhondi et al, 2015 ), heavily rely on expert knowledge in the chemical domain. Recently, machine learning-based techniques have reported state-of-the-art effectiveness in chemical text mining (Hemati and Mehler, 2019 ; Zhai et al, 2019 ). However, such techniques require a large amount of annotated text data, which still remains limited.…”

Section: Related Workmentioning

confidence: 99%

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents

Nguyen

Akhondi

et al. 2021

Front. Res. Metr. Anal.

View full text Add to dashboard Cite

Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.

show abstract

“…We used the PyMedTermino library (Lamy et al, 2015) for concept indexing. A full-text search with the Levenshtein distance algorithm (Miller et al, 2009) was applied in a first instance for concept indexing and fuzzy search with threshold using FuzzyDict implementation (Hemati and Mehler, 2019) as a second approach for concepts not found by partial matching. The FastText model uses a combination of various subcomponents to produce high-quality embeddings.…”

Section: Medical Word and Concept Embeddingsmentioning

confidence: 99%

“…One of the most effective methods is Conditional Random Fields (CRF) (Lafferty et al, 2001) since CRF is one of the most reliable sequence labeling methods. Recently, deep learning-based methods have also demonstrated state-of-the-art performance for English (Hemati and Mehler, 2019;Pérez-Pérez et al, 2017;Suárez-Paniagua et al, 2019) texts by automatically learning relevant patterns from corpora, which allows language and domain independence. However, until now, to the best of our knowledge, there is only one work that addresses the generation of Spanish biomedical word embeddings (Armengol-Estapé Jordi, 2019;Soares et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Deep neural model with enhanced embeddings for pharmaceutical and chemical entities recognition in Spanish clinical text

Rivera¹,

Martı́nez

2019

Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

View full text Add to dashboard Cite

In this work, we introduce a Deep Learning architecture for pharmaceutical and chemical Named Entity Recognition in Spanish clinical cases texts. We propose a hybrid model approach based on two Bidirectional Long Short-Term Memory (Bi-LSTM) network and Conditional Random Field (CRF) network using character, word, concept and sense embeddings to deal with the extraction of semantic, syntactic and morphological features. The approach was evaluated on the Pharma-CoNER Corpus obtaining an F-measure of 85.24% for subtask 1 and 49.36% for sub-task2. These results prove that deep learning methods with specific domain embedding representations can outperform the state-of-theart approaches.

show abstract

LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools

Cited by 36 publications

References 26 publications

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents

Deep neural model with enhanced embeddings for pharmaceutical and chemical entities recognition in Spanish clinical text

Contact Info

Product

Resources

About