Automatic extraction of protein-protein interactions using grammatical relationship graph

Yu, Kaixian; Lung, Pei-Yau; Zhao, Tingting; Zhao, Pei‐Ji; Tseng, Yan-Yuan; Zhang, Jinfeng

doi:10.1186/s12911-018-0628-4

Cited by 27 publications

(17 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…00043 useful information from unstructured narrative text. 4,5 NLP algorithms have been widely applied to medical research in recent years for various tasks, including extracting protein-protein interactions, 6 predicting gene-disease associations from biomedical literature databases, 7 improving the sensitivity of screening for suicide behaviors among pregnant women from electronic health record systems, 8 and correlating mammographic imaging features with pathologic findings. 9 Our group previously used an NLP algorithm to automatically parse breast pathologic reports in both English and Chinese.…”

Section: Introductionmentioning

confidence: 99%

Validation of a Semiautomated Natural Language Processing–Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance

Deng

Yin

Bao

et al. 2019

JCO Clinical Cancer Informatics

View full text Add to dashboard Cite

PURPOSE Quantifying the risk of cancer associated with pathogenic mutations in germline cancer susceptibility genes—that is, penetrance—enables the personalization of preventive management strategies. Conducting a meta-analysis is the best way to obtain robust risk estimates. We have previously developed a natural language processing (NLP) –based abstract classifier which classifies abstracts as relevant to penetrance, prevalence of mutations, both, or neither. In this work, we evaluate the performance of this NLP-based procedure. MATERIALS AND METHODS We compared the semiautomated NLP-based procedure, which involves automated abstract classification and text mining, followed by human review of identified studies, with the traditional procedure that requires human review of all studies. Ten high-quality gene–cancer penetrance meta-analyses spanning 16 gene–cancer associations were used as the gold standard by which to evaluate the performance of our procedure. For each meta-analysis, we evaluated the number of abstracts that required human review (workload) and the ability to identify the studies that were included by the authors in their quantitative analysis (coverage). RESULTS Compared with the traditional procedure, the semiautomated NLP-based procedure led to a lower workload across all 10 meta-analyses, with an overall 84% reduction (2,774 abstracts v 16,941 abstracts) in the amount of human review required. Overall coverage was 93%—we are able to identify 132 of 142 studies—before reviewing references of identified studies. Reasons for the 10 missed studies included blank and poorly written abstracts. After reviewing references, nine of the previously missed studies were identified and coverage improved to 99% (141 of 142 studies). CONCLUSION We demonstrated that an NLP-based procedure can significantly reduce the review workload without compromising the ability to identify relevant studies. NLP algorithms have promising potential for reducing human efforts in the literature review process.

show abstract

Section: Introductionmentioning

confidence: 99%

Validation of a Semiautomated Natural Language Processing–Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance

Deng

Yin

Bao

et al. 2019

JCO Clinical Cancer Informatics

View full text Add to dashboard Cite

show abstract

“…However, we are mainly interested in the precision metric, motivated by a real-world application of the system. We compared our method to existing methods in literature, including rule-based approaches [19], [20], featureand kernel-based approaches [21]- [26], and neural network approaches [27]- [31]. Additionally, we compared our system to recent transformer-based methods pre-trained on biomedical texts, namely BioBERT [34] and SciBERT [35].…”

Section: Resultsmentioning

confidence: 99%

“…Fundel et al [19] showed how a small number of carefully designed rules based on the shortest dependency path (SDP) between two examined entities produces fairly good results. Yu et al [20] exploited dependency parse trees and a flexible pattern matching scheme, enriching the system with a decision tree classifier. Diverse syntactic and orthography features have been extensively used in feature-and kernel-based methods.…”

Section: Related Workmentioning

confidence: 99%

High-Precision Biomedical Relation Extraction for Reducing Human Curation Efforts in Industrial Applications

et al. 2020

View full text Add to dashboard Cite

The body of biomedical literature is growing at an unprecedented rate, exceeding the ability of researchers to make effective use of this knowledge-rich amount of information. This growth has created interest in biomedical relation extraction approaches to extract domain-specific knowledge for diverse applications. Despite the great progress in the techniques, the retrieved evidence still needs to undergo a time-consuming manual curation process to be truly useful. Most relation extraction systems have been conceived in the context of Shared Tasks, with the goal of maximizing the F1 score on restricted, domainspecific test sets. However, in industrial applications relations typically serve as input to a pipeline of biologically driven analyses; as a result, highly precise extractions are central for cutting down the manual curation effort, thus to translate the research evidence into practice smoothly and reliably. In this paper, we present a highly precise relation extraction system designed to reduce human curation efforts. The engine is made up of sophisticated rules that leverage linguistic aspects of the texts rather than sticking on application-specific training data. As a result, the system could be applied to diverse needs. Experiments on gold-standard corpora show that the system achieves the highest precision compared with previous rulebased, kernel-based, and neural approaches, while maintaining a F1 score comparable or superior to other methods. To show the usefulness of our approach in industrial scenarios, we finally present a case study on the mTOR pathway, showing how it could be applied on a large-scale.

show abstract

“…These sentences were then dependency parsed. A protein-protein extraction system previously developed was then applied based on features extracted during the previous steps [ 28 , 54 , 55 ]. Finally, features were extracted from preceding steps in the pipeline and were used as input for training an XGBoost classifier [ 56 ].…”

Section: Methodsmentioning

confidence: 99%

“…It was constructed based on the personal knowledge of the authors and manual review of sentences in the literature known to contain protein-protein interactions [ 21 ]. This dictionary has been successfully employed in similar applications [ 7 , 28 , 61 – 64 ].…”

Section: Methodsmentioning

confidence: 99%

Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach

Steppi

Zhong

et al. 2020

BMC Genomics

Self Cite

View full text Add to dashboard Cite

Background Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. Results Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. Conclusions The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.

show abstract

Automatic extraction of protein-protein interactions using grammatical relationship graph

Cited by 27 publications

References 77 publications

Validation of a Semiautomated Natural Language Processing–Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance

Validation of a Semiautomated Natural Language Processing–Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance

High-Precision Biomedical Relation Extraction for Reducing Human Curation Efforts in Industrial Applications

Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach

Contact Info

Product

Resources

About