Large-scale extraction of gene interactions from full-text literature using DeepDive

Mallory, Emily K.; Zhang, Ce; Ré, Christopher; Altman, Russ B.

doi:10.1093/bioinformatics/btv476

Cited by 67 publications

(46 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Distant supervision has been used as a solution to fix the barren amount of large datasets [4]. Approaches have used this paradigm to extract chemical-gene interactions [54], disease-gene associations [30] and protein-protein interactions [30,54,60]. In fact, efforts done in [60] served as one of the motivating rationales for our work.…”

Section: Supervised Extractorsmentioning

confidence: 99%

Expanding a Database-derived Biomedical Knowledge Graph via Multi-relation Extraction from Biomedical Abstracts

Nicholson

Himmelstein

Greene

2019

Preprint

View full text Add to dashboard Cite

Knowledge graphs support multiple research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via some form of manual curation, which is difficult to scale in the context of an increasing publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to automatically annotate textual data. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This makes populating a knowledge graph with multiple nodes and edge types practically infeasible. We sought to accelerate the label function creation process by evaluating the extent to which label functions could be re-used across multiple edge types. We used a subset of an existing knowledge graph centered on disease, compound, and gene entities to evaluate label function re-use. We determined the best label function combination by comparing a baseline database-only model with the same model but added edge-specific or edge-mismatch label functions. We confirmed that adding additional edge-specific rather than edge-mismatch label functions often improves text annotation and shows that this approach can incorporate novel edges into our source knowledge graph. We expect that continued development of this strategy has the potential to swiftly populate knowledge graphs with new discoveries, ensuring that these resources include cutting-edge results.

show abstract

Section: Supervised Extractorsmentioning

confidence: 99%

Expanding a Database-derived Biomedical Knowledge Graph via Multi-relation Extraction from Biomedical Abstracts

Nicholson

Himmelstein

Greene

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Lastly, Relation Extraction (RE) is a task for extracting pre-defined facts relating to an entity or entities in the text [29]. In biomedical domain, multiple RE methods have been developed to extract information relating to genes [16], such as Mutation-Disease associations, protein-protein interaction [30,31], pathway curation [32], gene methylation and cancer relation [33], biomolecular events [34], metabolic reactions [35] and gene-gene interactions [36]. For gene regulatory networks, which is the focus of this paper, the RE sys-tem must detect and extract a causal relation between a protein and a gene (e.g., A regulated B).…”

Section: Overview and Related Workmentioning

confidence: 99%

ModEx: A text mining system for extracting mode of regulation of Transcription Factor-gene regulatory interaction

Farahmand

Riley

Zarringhalam

2019

Preprint

View full text Add to dashboard Cite

A B S T R A C T Background: Transcription factors (TFs) are proteins that are fundamental to transcription and regulation of gene expression. Each TF may regulate multiple genes and each gene may be regulated by multiple TFs. TFs can act as either activator or repressor of gene expression. This complex network of interactions between TFs and genes underlies many developmental and biological processes and is implicated in several human diseases such as cancer. Hence deciphering the network of TFgene interactions with information on mode of regulation (activation vs. repression) is an important step toward understanding the regulatory pathways that underlie complex traits. There are many experimental, computational, and manually curated databases of TF-gene interactions. In particular, high-throughput ChIP-Seq datasets provide a large-scale map or transcriptional regulatory interactions. However, these interactions are not annotated with information on context and mode of regulation. Such information is crucial to gain a global picture of gene regulatory mechanisms and can aid in developing machine learning models for applications such as biomarker discovery, prediction of response to therapy, and precision medicine. Methods: In this work, we introduce a text-mining system to annotate ChIP-Seq derived interaction with such meta data through mining PubMed articles. We evaluate the performance of our system using gold standard small scale manually curated databases. Results: Our results show that the method is able to accurately extract mode of regulation with F-score 0.77 on TRRUST curated interaction and F-score 0.96 on intersection of TRUSST and ChIP-network. We provide a HTTP REST API for our code to facilitate usage. Availibility: Source code and datasets are available for download on GitHub: https:

show abstract

“…With the exponential growth of the literature, manual curation requires prioritization of specific drugs or genes in order to stay up to date with current research. In collaboration with Emily Mallory and Prof. Russ Altman [30] at Stanford, we are developing DeepDive applications in the field of pharmacogenomics. Specifically, we use DeepDive to extract relations between genes, diseases, and drugs in order to predict novel pharmacological relationships.…”

Section: Applicationsmentioning

confidence: 99%

Extracting Databases from Dark Data with DeepDive

Zhang

Shin

Ré

et al. 2016

Proceedings of the 2016 International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts a massive and valuable new set of “big data.” DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.

show abstract

Large-scale extraction of gene interactions from full-text literature using DeepDive

Cited by 67 publications

References 31 publications

Expanding a Database-derived Biomedical Knowledge Graph via Multi-relation Extraction from Biomedical Abstracts

Expanding a Database-derived Biomedical Knowledge Graph via Multi-relation Extraction from Biomedical Abstracts

ModEx: A text mining system for extracting mode of regulation of Transcription Factor-gene regulatory interaction

Extracting Databases from Dark Data with DeepDive

Contact Info

Product

Resources

About