Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics

Almeida, Tiago; Antunes, Rui; Silva, João Figueira; Almeida, João Rafael; Matos, Sérgio

doi:10.1093/database/baac047

Cited by 8 publications

(2 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For diseases and chemicals, we include in the RBES category two systems which are only “partly” rule-based (stretching our definition), as they better represent the state of the art of disease/chemical-specific models. We use “TaggerOne” ( Leaman and Lu 2016 ), a semi-Markov model, for diseases, and opt for the system that won the BioCreative VII NLM-Chem track ( Almeida et al 2022 ) for chemicals (“BC7T2W”), which uses both string matching and neural embeddings. To the best of our knowledge there exists no linking approach specific for cell lines.…”

Section: Methodsmentioning

confidence: 99%

BELB: a biomedical entity linking benchmark

Garda,

Weber-Genzel,

Martin

et al. 2023

Bioinformatics

View full text Add to dashboard Cite

Motivation Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base. It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage knowledge base UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. Results We therefore developed BELB, a Biomedical Entity Linking Benchmark, providing access in a unified format to 11 corpora linked to 7 knowledge bases and spanning six entity types: gene, disease, chemical, species, cell line and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. Availability The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Methodsmentioning

confidence: 99%

BELB: a biomedical entity linking benchmark

Garda,

Weber-Genzel,

Martin

et al. 2023

Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Various statistical model-based NER algorithms have also been proposed, often as a sequence labeling problem where the tokens in a sentence are assigned most likely tags based on token features. A popular strategy is the use of conditional random fields 11 in combination with expertselected features 12 or contextualized word embeddings from neural networks (recurrent networks [13][14][15] , or transformers [16][17][18][19] ).…”

Section: Introductionmentioning

confidence: 99%

Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

Ai,

Meng,

Shi

et al. 2024

Preprint

View full text Add to dashboard Cite

The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and due to the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we leverage the power of fine-tuned large language models (LLMs) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.

show abstract

Automatic Construction of Named Entity Corpus for Adverse Drug Reaction Prediction

Dev

Sharan

2023

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics

Cited by 8 publications

References 77 publications

BELB: a biomedical entity linking benchmark

BELB: a biomedical entity linking benchmark

Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

Automatic Construction of Named Entity Corpus for Adverse Drug Reaction Prediction

Contact Info

Product

Resources

About