Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies

Biziukova, Nadezhda; Tarasova, O.; Ivanov, Sergey; Poroikov, Vladimir

doi:10.3389/fgene.2020.618862

Cited by 4 publications

(11 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most AI-based approaches initially convert text into vectors or use sparse word text representation created with preprocessing of a text corpus, and vector preparation (for instance, such approaches include word embedding preparation or the one-hot-encoding technique). It should be noted that the performance of CNER using the naïve-Bayes approach, in general, is comparable with most of earlier published methods [ 16 , 18 , 22 – 25 ], while it is slightly lower comparing to some other approaches based on the results of fivefold CV [ 19 , 45 , 46 ].…”

Section: Resultssupporting

confidence: 64%

“…Many various artificial intelligence (AI) approaches aimed at chemical and biological named entity recognition have been developed [ 15 , 18 , 21 ]. Most approaches that have been under recent development for several years are based on the usage of neural networks with different variants of long-short term memory (LSTM) architecture or conditional random fields (CRF) [ 16 , 42 ].…”

Section: Resultsmentioning

confidence: 99%

“…An obvious disadvantage of the rule-based and dictionary-based methods is a limited number of CNEs that can be recognized due to the fixed size of dictionaries or rule numbers. Machine learning or artificial intelligence approaches [ 16 , 17 ] mainly use support vector machines [ 18 ] or artificial neural networks [ 8 , 14 ] including deep learning architectures [ 19 ]. Typically, these methods can reach an accuracy of approximately 85–95% [ 8 , 14 , 17 , 18 ].…”

Section: Introductionmentioning

confidence: 99%

“…There is a constant need in development of approaches providing possibility of their use by many researchers in the bridging field of chemoinformatics including medicinal chemistry, computational biology, drug discovery, material science, etc. The new methods aimed at easy-to-use, accurate and fast CNER are still in demand [ 16 ].…”

Section: Introductionmentioning

confidence: 99%

“…Some methods are sensitive to imbalanced data [20]. Long-short term memory networks (LSTM) [12][13][14][21][22][23][24] or conditional random fields (CRF) [16,25] are efficiently applied to the task of named entity extraction. The architecture of neural networks can be modified according to the particular task of CNER [26][27][28].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach

et al. 2022

View full text Add to dashboard Cite

Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. Methods and results We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. Conclusion The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry.

show abstract

Section: Resultssupporting

confidence: 64%

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach

et al. 2022

View full text Add to dashboard Cite

show abstract

Computational methods in the analysis of viral-host interactions

Tarasova,

Ivanov,

Biziukova

et al. 2023

Cheminformatics, QSAR and Machine Learning Applications for Novel Drug Development

View full text Add to dashboard Cite

Identification of Proteins and Genes Associated with Hedgehog Signaling Pathway Involved in Neoplasm Formation Using Text-Mining Approach

Biziukova,

Ivanov,

Tarasova

2024

Big Data Min. Anal.

View full text Add to dashboard Cite

Analysis of molecular mechanisms that lead to the development of various types of tumors is essential for biology and medicine, because it may help to find new therapeutic opportunities for cancer treatment and cure including personalized treatment approaches. One of the pathways known to be important for the development of neoplastic diseases and pathological processes is the Hedgehog signaling pathway that normally controls human embryonic development. Systematic accumulation of various types of biological data, including interactions between proteins, regulation of genes transcription, proteomics, and metabolomics experiments results, allows the application of computational analysis of these big data for identification of key molecular mechanisms of certain diseases and pathologies and promising therapeutic targets. The aim of this study is to develop a computational approach for revealing associations between human proteins and genes interacting with the Hedgehog pathway components, as well as for identifying their roles in the development of various types of tumors. We automatically collect sets of abstract texts from the NCBI PubMed bibliographic database. For recognition of the Hedgehog pathway proteins and genes and neoplastic diseases we use a dictionary-based named entity recognition approach, while for all other proteins and genes machine learning method is used. For association extraction, we develop a set of semantic rules. We complete the results of the text analysis with the gene set enrichment analysis. The identified key pathways that may influence the Hedgehog pathway and their roles in tumor development are then verified using the information in the literature.

show abstract

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies

Cited by 4 publications

References 59 publications

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach

Computational methods in the analysis of viral-host interactions

Identification of Proteins and Genes Associated with Hedgehog Signaling Pathway Involved in Neoplasm Formation Using Text-Mining Approach

Contact Info

Product

Resources

About