Text mining is widely used within the life sciences as an evidence stream for inferring relationships between biological entities. In most cases, conventional string matching is used to identify cooccurrences of given entities within sentences. This limits the utility of text mining results, as they tend to contain significant noise due to weak inclusion criteria. We show that, in the indicative case of protein-protein interactions (PPIs), the majority of sentences containing cooccurrences (∽75%) do not describe any causal relationship. We further demonstrate the feasibility of fine tuning a strong domain-specific language model, BioBERT, to analyse sentences containing cooccurrences and accurately (F1 score: 88.95%) identify functional links between proteins. These strong results come in spite of the deep complexity of the language involved, which limits the accuracy even of expert curators. We establish guidelines for best practices in data creation to this end, including an examination of inter-annotator agreement, of semisupervision, and of rules based alternatives to manual curation, and explore the potential for downstream use of the model to accelerate curation of interactions in the SIGNOR database of causal protein interactions and the IntAct database of experimental evidence for physical protein interactions.
The COVID-19 Open Research Dataset (CORD-19) was released in March 2020 to allow the machine learning and wider research community to develop techniques to answer scientific questions on COVID-19. The dataset consists of a large collection of scientific literature, including over 100,000 full text papers. Annotating training data to normalise variability in biological entities can improve the performance of downstream analysis and interpretation. To facilitate and enhance the use of the CORD-19 data in these applications, in late March 2020 we performed a comprehensive annotation process using named entity recognition tool, TERMite, along with a number of large reference ontologies and vocabularies including domains of genes, proteins, drugs and virus strains. The additional annotation has identified and tagged over 45 million entities within the corpus made up of 62,746 unique biomedical entities. The latest updated version of the annotated data, as well as older versions, is made openly available under GPL-2.0 License for the community to use at: https://github.com/SciBiteLabs/CORD19
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.