Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Gu, Yu; Tinn, Robert; Cheng, Hao; Lucas, Michael; Usuyama, Naoto; Liu, Xiaodong; Naumann, Tristan; Gao, Jianfeng; Poon, Hoifung

doi:10.1145/3458754

Cited by 860 publications

(680 citation statements)

References 50 publications

Supporting

Mentioning

514

Contrasting

Unclassified

Order By: Relevance

“…Using PTLMs available within the HuggingFace Transformers library, we will experiment with variations of BERT models to determine which have the best performance in article classification. These will include BERT [ 23 ], BioBERT [ 45 ], BlueBERT [ 46 ], and PubMedBERT [ 47 ]. These models differ in the pretraining text domain.…”

Section: Methodsmentioning

confidence: 99%

A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation

Abdelkader¹,

Navarro²,

Parrish³

et al. 2021

JMIR Res Protoc

View full text Add to dashboard Cite

Background A barrier to practicing evidence-based medicine is the rapidly increasing body of biomedical literature. Use of method terms to limit the search can help reduce the burden of screening articles for clinical relevance; however, such terms are limited by their partial dependence on indexing terms and usually produce low precision, especially when high sensitivity is required. Machine learning has been applied to the identification of high-quality literature with the potential to achieve high precision without sacrificing sensitivity. The use of artificial intelligence has shown promise to improve the efficiency of identifying sound evidence. Objective The primary objective of this research is to derive and validate deep learning machine models using iterations of Bidirectional Encoder Representations from Transformers (BERT) to retrieve high-quality, high-relevance evidence for clinical consideration from the biomedical literature. Methods Using the HuggingFace Transformers library, we will experiment with variations of BERT models, including BERT, BioBERT, BlueBERT, and PubMedBERT, to determine which have the best performance in article identification based on quality criteria. Our experiments will utilize a large data set of over 150,000 PubMed citations from 2012 to 2020 that have been manually labeled based on their methodological rigor for clinical use. We will evaluate and report on the performance of the classifiers in categorizing articles based on their likelihood of meeting quality criteria. We will report fine-tuning hyperparameters for each model, as well as their performance metrics, including recall (sensitivity), specificity, precision, accuracy, F-score, the number of articles that need to be read before finding one that is positive (meets criteria), and classification probability scores. Results Initial model development is underway, with further development planned for early 2022. Performance testing is expected to star in February 2022. Results will be published in 2022. Conclusions The experiments will aim to improve the precision of retrieving high-quality articles by applying a machine learning classifier to PubMed searching. International Registered Report Identifier (IRRID) DERR1-10.2196/29398

show abstract

Section: Methodsmentioning

confidence: 99%

A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation

Abdelkader¹,

Navarro²,

Parrish³

et al. 2021

JMIR Res Protoc

View full text Add to dashboard Cite

show abstract

“…PubMedBERT is a BERT model that pre-trained on biomedical text from the scratch by Microsoft research team. The assumption is that pre-training the BERT model solely on the text would perform better than general-domain text (10). PubMedBERT outperformed all prior language models and obtained new SOTA results in a wide range of biomedical applications (10).…”

Section: A Chemical Named Entity Recognitionmentioning

confidence: 99%

“…The assumption is that pre-training the BERT model solely on the text would perform better than general-domain text (10). PubMedBERT outperformed all prior language models and obtained new SOTA results in a wide range of biomedical applications (10). We chose to use PubMedBERT as the base model for chemical NER task.…”

Section: A Chemical Named Entity Recognitionmentioning

confidence: 99%

“…There have been many methods developed in the past for chemical NER and deep learning methods such as BiLSTM (2), Spacy (3), and OSCAR4 (4) have substantially improved the performance compared to traditional methods. Recently, BERT and its variants (5)(6)(7)(8)(9)(10) have achieved the state-of-the-art (SOTA) performance.…”

Section: Introductionmentioning

confidence: 99%

“…In BioCreative VII challenge, we first experimented with several BERT-based models, including BERT large uncased (5), BioBERT (7), BlueBERT (8), SciBERT (6), ClinicalBERT (9), and PubMedBERT (10). PubMedBERT outperformed other models when evaluated using the development data.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A BERT-Based Hybrid System for Chemical Identification and Indexing in Full-Text Articles

Erdengasileng

Han

et al. 2021

Preprint

View full text Add to dashboard Cite

Identification and indexing of chemical compounds in full-text articles are essential steps in biomedical article categorization, information extraction, and biological text mining. BioCreative Challenge was established to evaluate methods for biological text mining and information extraction. Track 2 of BioCreative VII (summer 2021) consists of two subtasks: chemical identification and chemical indexing in full-text PubMed articles. The chemical identification subtask also includes two parts: chemical named entity recognition (NER) and chemical normalization. In this paper, we present our work on developing a hybrid pipeline for chemical named entity recognition, chemical normalization, and chemical indexing in full-text PubMed articles. Specifically, we applied BERT-based methods for chemical NER and chemical indexing, and a sieve-based dictionary matching method for chemical normalization. For subtask 1, we used PubMedBERT with data augmentation on the chemical NER task. Several chemical-MeSH dictionaries including MeSH.XML, SUPP.XML, MRCONSO.RFF, and PubTator chemical annotations are used in a specific order to get the best performance on chemical normalization. We achieved an F1 score of 0.86 and 0.7668 on chemical NER and chemical normalization, respectively. For subtask 2, we formulated it as a binary prediction problem for each individual chemical compound name. We then used a BERT-based model with engineered features and achieved a strict F1 score of 0.4825 on the test set, which is substantially higher than the median F1 score (0.3971) of all the submissions.

show abstract

Automating Materials Exploration with a Semantic Knowledge Graph for Li‐Ion Battery Cathodes

Nie

Zheng

Liu

et al. 2022

Adv Funct Materials

View full text Add to dashboard Cite

The recent marriage of materials science and artificial intelligence has created the need to extract and collate materials information from the tremendous backlog of academic publications. However, this is notoriously hard to achieve in sophisticated application domains, such as Li-ion battery (LIB) cathodes, which require multiple variables for materials selection, making it challenging to automatically identify the critical terms in the text. Herein, a semantics representation framework, featuring a dual-attention module that refines word embeddings through multi-source information fusion, is proposed for literature mining of LIB cathodes. The word embeddings thus produced are biased toward domain-specific knowledge and can enable the detection of deep-seated associations among materials for targeted applications. Based on this framework, we establish a semantic knowledge graph dedicated to LIB cathodes, which allows us to unravel the latent materials relationships from scientific literature and even to discover candidate materials not yet exploited as cathodes before. This work provides a long-sought path to the realization of text-mining-based knowledge management for complicated materials systems with little dependence on domain expertise.

show abstract

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Cited by 860 publications

References 50 publications

A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation

A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation

A BERT-Based Hybrid System for Chemical Identification and Indexing in Full-Text Articles

Automating Materials Exploration with a Semantic Knowledge Graph for Li‐Ion Battery Cathodes

Contact Info

Product

Resources

About