2019
DOI: 10.1101/545913
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Predicting Pathogenicity of Missense Variants with Weakly Supervised Regression

Abstract: Quickly growing genetic variation data of unknown clinical significance demand computational methods that can reliably predict clinical phenotypes and deeply unravel molecular mechanisms. On the platform enabled by the Critical Assessment of Genome Interpretation (CAGI), we develop a novel “weakly supervised” regression (WSR) model that not only predicts precise clinical significance (probability of pathogenicity) from inexact training annotations (class of pathogenicity) but also infers underlying molecular m… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2019
2019
2019
2019

Publication Types

Select...
3

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 53 publications
0
4
0
Order By: Relevance
“…The Disease Index Matrix (P d ) is a scale that associates each variant type (i.e., pair of wild type and variant residues) with the probability of being related to the disease. The scale has been estimated with a statistical analysis of a large data set of disease‐related and neutral variations retrieved from UniProtKB and dbSNP databases. AIBI directly predicted the probability of pathogenicity with weakly supervised linear regression, as detailed in the CAGI5 special issue (Cao et al, ) as the exact probabilities are not available for supervised machine learning. They used variants annotated with the class of pathogenicity in ClinVar, selected from MutPred2 15 features about molecular impacts upon variation, and designed parabola‐shaped loss functions that penalize the predicted probability of pathogenicity according to its supposed class. Color Genomics submitted four sets of predictions with LEAP (Lai et al, ), a machine learning framework that predicts variant pathogenicity according to features including: population frequencies from gnomAD; function prediction from SnpEFF (Cingolani et al, ), SIFT(Ng & Henikoff, ), PolyPhen‐2 (Adzhubei, Jordan, & Sunyaev, ) and MutationTaster2 (Schwarz, Cooper, Schuelke, & Seelow, ); splice impact estimation from Alamut (Interactive Biosoftware, Rouen, France) and Skippy (Woolfe, Mullikin, & Elnitski, ); indications of publications mentioning the variant and cancer associations from the subscription version of HGMD, indicating whether or not the variant is included in HGMD, whether or not it is associated with one or more articles curated by HGMD, and whether HGMD associates the variant with cancer (Stenson et al, ); and aggregate information from individuals who have undergone genetic testing.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The Disease Index Matrix (P d ) is a scale that associates each variant type (i.e., pair of wild type and variant residues) with the probability of being related to the disease. The scale has been estimated with a statistical analysis of a large data set of disease‐related and neutral variations retrieved from UniProtKB and dbSNP databases. AIBI directly predicted the probability of pathogenicity with weakly supervised linear regression, as detailed in the CAGI5 special issue (Cao et al, ) as the exact probabilities are not available for supervised machine learning. They used variants annotated with the class of pathogenicity in ClinVar, selected from MutPred2 15 features about molecular impacts upon variation, and designed parabola‐shaped loss functions that penalize the predicted probability of pathogenicity according to its supposed class. Color Genomics submitted four sets of predictions with LEAP (Lai et al, ), a machine learning framework that predicts variant pathogenicity according to features including: population frequencies from gnomAD; function prediction from SnpEFF (Cingolani et al, ), SIFT(Ng & Henikoff, ), PolyPhen‐2 (Adzhubei, Jordan, & Sunyaev, ) and MutationTaster2 (Schwarz, Cooper, Schuelke, & Seelow, ); splice impact estimation from Alamut (Interactive Biosoftware, Rouen, France) and Skippy (Woolfe, Mullikin, & Elnitski, ); indications of publications mentioning the variant and cancer associations from the subscription version of HGMD, indicating whether or not the variant is included in HGMD, whether or not it is associated with one or more articles curated by HGMD, and whether HGMD associates the variant with cancer (Stenson et al, ); and aggregate information from individuals who have undergone genetic testing.…”
Section: Methodsmentioning
confidence: 99%
“…AIBI directly predicted the probability of pathogenicity with weakly supervised linear regression, as detailed in the CAGI5 special issue (Cao et al, ) as the exact probabilities are not available for supervised machine learning. They used variants annotated with the class of pathogenicity in ClinVar, selected from MutPred2 15 features about molecular impacts upon variation, and designed parabola‐shaped loss functions that penalize the predicted probability of pathogenicity according to its supposed class.…”
Section: Methodsmentioning
confidence: 99%
“…In addition to the other cancer‐related challenges outlined above, there are two that required prediction of the pathogenicity of germline variants in cancer‐related proteins: one for breast cancer risk from variants in BRCA1 and BRCA2 as characterized by the ENIGMA consortium (Cao et al, ; Cline et al, ; Padilla et al, ; Parsons et al, ), and the other for cancer risk of variants in CHEK2 in Latina breast cancer cases and ancestry matched controls (Voskanian et al, ).…”
Section: Introductionmentioning
confidence: 99%
“…In total, 2,026 variations of six tumor suppressors (CHEK2, BRCA1, BRCA2, BRIP1, RBBP8, and TP53) were collected. Using MutPred2, 15 features were extracted; together with a constant as the16th feature, used in linear regression with a tailored loss function (Cao et al, ) Specifically, to describe a penalty more in line with the real biological processes while reducing the complexity of the optimization, the loss function needs to be convex and first‐order differentiable. To accommodate these two conditions, a parabola‐shaped polynomial of degree six as the loss function was implemented.…”
Section: Methodsmentioning
confidence: 99%