2016
DOI: 10.1101/092890
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Creating a universal SNP and small indel variant caller with deep neural networks

Abstract: Next-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual's genome 1 by calling genetic variants present in an individual using billions of short, errorful sequence reads 2 . Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome 3,4 .Here we show that a deep convolut… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

1
78
0
1

Year Published

2017
2017
2020
2020

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 83 publications
(83 citation statements)
references
References 32 publications
1
78
0
1
Order By: Relevance
“…Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome [13]. Deep Neural Network based learning algorithms provide attractive solutions for many bio-informatics [14] problems because of their ability to scale for large dataset, and their effectiveness in identification of intrusive complex features from underlying data.…”
Section: Deep Learning Based Approaches In Bio-informaticsmentioning
confidence: 99%
See 1 more Smart Citation
“…Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome [13]. Deep Neural Network based learning algorithms provide attractive solutions for many bio-informatics [14] problems because of their ability to scale for large dataset, and their effectiveness in identification of intrusive complex features from underlying data.…”
Section: Deep Learning Based Approaches In Bio-informaticsmentioning
confidence: 99%
“…CNN based models are used exhaustively in identification of Motif in DNA sequence [18], but not much has been done for the prediction of SNVs in raw DNA reads. Only recently, Google has developed DeepVariant [19] tool which uses deep learning to predict variants in aligned and cleaned DNA reads. DeepVariant method uses CNN as universal approximator for identification of variants in NGS reads.…”
Section: Deep Learning Based Approaches In Bio-informaticsmentioning
confidence: 99%
“…A four-layer dense network considering only information at the candidate site can achieve reasonable performance [29,30]. Poplin and colleagues further converted the read pileup at a potential variable site into a 221 × 100-pixel RGB image, and then used Inception-v2 [31], a network architecture normally applied in image analysis tasks, to call mutation status [32]. Base identity, base quality, and strand information were encoded in the color channels, and no additional data were used.…”
Section: Variant Calling From Dna Sequencingmentioning
confidence: 99%
“…Within the somatic variant probability model, the original Strelka method has been redesigned with a further novel innovation to account for contamination of tumor cells in the matched normal sample such that somatic recall is improved, especially for liquid tumor analysis. Consistent with the emphasis on automated sample adaption in Strelka2, the liquid tumor model is an expansion of the model's state space applied to all cases, and thus does not require prior knowledge of the normal sample contamination level.For both germline and somatic calling workflows, the variant probability model is supplemented by a final empirical variant scoring (EVS) step, motivated in part by machine learning-based variant classification approaches 9,10 . This step uses a random forest model trained on numerous features indicative of call quality to improve precision by accounting for error phenomena that are not adequately represented in the generative variant probability model.…”
mentioning
confidence: 99%
“…For both germline and somatic calling workflows, the variant probability model is supplemented by a final empirical variant scoring (EVS) step, motivated in part by machine learning-based variant classification approaches 9,10 . This step uses a random forest model trained on numerous features indicative of call quality to improve precision by accounting for error phenomena that are not adequately represented in the generative variant probability model.…”
mentioning
confidence: 99%