Iz Beltagy scite author profile

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data.SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-theart results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

show abstract

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Gururangan¹,

Marasović²,

Swayamdipta³

et al. 2020

1,063

446

View full text Add to dashboard Cite

Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining indomain (domain-adaptive pretraining) leads to performance gains, under both high-and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multiphase adaptive pretraining offers large gains in task performance.

show abstract

SciBERT: A Pretrained Language Model for Scientific Text

Beltagy¹,

Lo²,

Cohan³

2019

Preprint

244

341

View full text Add to dashboard Cite

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Neumann¹,

King²,

Beltagy³

et al. 2019

509

334

View full text Add to dashboard Cite

Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scis-paCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https:// allenai.github.io/scispacy/.

show abstract

Construction of the Literature Graph in Semantic Scholar

Ammar¹,

Groeneveld²,

Bhagavatula³

et al. 2018

285

204

View full text Add to dashboard Cite

We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in www.semanticscholar.org.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Iz Beltagy

SciBERT: A Pretrained Language Model for Scientific Text

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

SciBERT: A Pretrained Language Model for Scientific Text

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Construction of the Literature Graph in Semantic Scholar

Contact Info

Product

Resources

About