2022
DOI: 10.1101/2022.08.25.505311
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Genome-wide prediction of disease variants with a deep protein language model

Abstract: Distinguishing between damaging and neutral missense variants is an ongoing challenge in human genetics, with profound implications for clinical diagnosis, genetic studies and protein engineering. Recently, deep-learning models have achieved state-of-the-art performance in classifying variants as pathogenic or benign. However, these models are currently unable to provide predictions over all missense variants, either because of dependency on close protein homologs or due to software limitations. Here we levera… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
125
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 58 publications
(125 citation statements)
references
References 56 publications
0
125
0
Order By: Relevance
“…It consists of two different benchmarks measuring mutations made via substitutions (~1.5M missense variants across 87 DMS assays) and indels (~300K mutants across 7 DMS assays). In our experiments, we compared against all baselines already available in ProteinGym (e.g., Tranception, EVE, MSA Transformer) and added the following: Unirep [Alley et al, 2019], RITA [Hesslow et al, 2022], Progen2 [Nijkamp et al, 2022] and ESM-1b (original model from Rives et al [2021] with extensions from Brandes et al [2022]; Appendix C.1).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…It consists of two different benchmarks measuring mutations made via substitutions (~1.5M missense variants across 87 DMS assays) and indels (~300K mutants across 7 DMS assays). In our experiments, we compared against all baselines already available in ProteinGym (e.g., Tranception, EVE, MSA Transformer) and added the following: Unirep [Alley et al, 2019], RITA [Hesslow et al, 2022], Progen2 [Nijkamp et al, 2022] and ESM-1b (original model from Rives et al [2021] with extensions from Brandes et al [2022]; Appendix C.1).…”
Section: Methodsmentioning
confidence: 99%
“…For ESM-1b, we use the official ESM repository for model checkpoint and mutation effects predictions. We further extend the codebase to handle sequences longer than the model context size (i.e., longer than 1,022 amino acids) as per the procedure described in Brandes et al [2022].…”
Section: C1 Correlation With Deep Mutational Scanning Experimentsmentioning
confidence: 99%
“…Evolutionary scale models are also shown to perform unsupervised prediction of mutational effects (54, 55), and have recently been used in state-of-the-art applications, for example to predict the path of viral evolution (56, 57), and the clinical significance of gene variants (58). Several large scale models are now available as open source (14, 53, 59).…”
Section: Introductionmentioning
confidence: 99%
“…We benchmarked ES score against existing methods that predict oncogenic mutations using a list of TCGA somatic mutations 8 in oncogenes. To avoid information leakage, we selected 4 other methods that are not specifically trained on cancer mutation data: EVE 12 , ESM1b 13,14 , and MutationAssessor 9 are unsupervised methods while CTAT-Population 8 and VEST 3 are supervised methods. For assessing the hotspot mutations in cancer, we are interested in performance with a low false discovery rate as the functional characterization of each potential hotspot will be expensive.…”
Section: Resultsmentioning
confidence: 99%
“…However, both methods use information from recurrent cancer mutations, which could potentially bias the mechanistic interpretation of the model. Unsupervised methods [12][13][14] that do not directly rely on recurrence information might provide better biological insights on cancer hotspots and a more unbiased way to identify less common cancer risk variants.…”
Section: Introductionmentioning
confidence: 99%