Creating a universal SNP and small indel variant caller with deep neural networks

Poplin, Ryan; Chang, Pi-Chuan; Alexander, David H.; Schwartz, Scott; Colthurst, Thomas; A, Ku; Newburger, Daniel E.; Dijamco, Jojo; Nguyễn, Như Gia; Pt, Afshar; Gross, Steven S.; Dorfman, Lizzie; McLean, Cory Y.; Ma, DePristo

doi:10.1101/092890

Cited by 83 publications

(83 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome [13]. Deep Neural Network based learning algorithms provide attractive solutions for many bio-informatics [14] problems because of their ability to scale for large dataset, and their effectiveness in identification of intrusive complex features from underlying data.…”

Section: Deep Learning Based Approaches In Bio-informaticsmentioning

confidence: 99%

“…CNN based models are used exhaustively in identification of Motif in DNA sequence [18], but not much has been done for the prediction of SNVs in raw DNA reads. Only recently, Google has developed DeepVariant [19] tool which uses deep learning to predict variants in aligned and cleaned DNA reads. DeepVariant method uses CNN as universal approximator for identification of variants in NGS reads.…”

Section: Deep Learning Based Approaches In Bio-informaticsmentioning

confidence: 99%

See 1 more Smart Citation

DAVI:Deep Learning Based Tool for Alignment and Single Nucleotide Variant identification

Gupta

Saini

2019

Preprint

View full text Add to dashboard Cite

The Next Generation Sequencing (NGS) technologies have provided affordable ways to generate errorful raw genetical data. To extract Variant Information from billions of NGS reads is still a daunting task which involves various hand-crafted and parameterized statistical tools. Here we propose a Deep Neural Networks (DNN) based alignment and SNV tool known as DAVI. DAVI consists of models for both global and local alignment and for Variant Calling. We have evaluated the performance of DAVI against existing state of the art tool-set and found that its accuracy and performance is comparable to existing tools used for benchmarking. We further demonstrate that while existing tools are based on data generated from a specific sequencing technology, the models proposed in DAVI are generic and can be used across different NGS technologies. Moreover, this approach is a migration from expert driven statistical models to generic, automated, self-learning models.

show abstract

Section: Deep Learning Based Approaches In Bio-informaticsmentioning

confidence: 99%

Section: Deep Learning Based Approaches In Bio-informaticsmentioning

confidence: 99%

DAVI:Deep Learning Based Tool for Alignment and Single Nucleotide Variant identification

Gupta

Saini

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…A four-layer dense network considering only information at the candidate site can achieve reasonable performance [29,30]. Poplin and colleagues further converted the read pileup at a potential variable site into a 221 × 100-pixel RGB image, and then used Inception-v2 [31], a network architecture normally applied in image analysis tasks, to call mutation status [32]. Base identity, base quality, and strand information were encoded in the color channels, and no additional data were used.…”

Section: Variant Calling From Dna Sequencingmentioning

confidence: 99%

Computational biology: deep learning

Jones

Alasoo

Fishman

et al. 2017

Emerging Topics in Life Sciences

View full text Add to dashboard Cite

Deep learning is the trendiest tool in a computational biologist's toolbox. This exciting class of methods, based on artificial neural networks, quickly became popular due to its competitive performance in prediction problems. In pioneering early work, applying simple network architectures to abundant data already provided gains over traditional counterparts in functional genomics, image analysis, and medical diagnostics. Now, ideas for constructing and training networks and even off-the-shelf models have been adapted from the rapidly developing machine learning subfield to improve performance in a range of computational biology tasks. Here, we review some of these advances in the last 2 years.

show abstract

“…Within the somatic variant probability model, the original Strelka method has been redesigned with a further novel innovation to account for contamination of tumor cells in the matched normal sample such that somatic recall is improved, especially for liquid tumor analysis. Consistent with the emphasis on automated sample adaption in Strelka2, the liquid tumor model is an expansion of the model's state space applied to all cases, and thus does not require prior knowledge of the normal sample contamination level.For both germline and somatic calling workflows, the variant probability model is supplemented by a final empirical variant scoring (EVS) step, motivated in part by machine learning-based variant classification approaches 9,10 . This step uses a random forest model trained on numerous features indicative of call quality to improve precision by accounting for error phenomena that are not adequately represented in the generative variant probability model.…”

mentioning

confidence: 99%

“…For both germline and somatic calling workflows, the variant probability model is supplemented by a final empirical variant scoring (EVS) step, motivated in part by machine learning-based variant classification approaches 9,10 . This step uses a random forest model trained on numerous features indicative of call quality to improve precision by accounting for error phenomena that are not adequately represented in the generative variant probability model.…”

mentioning

confidence: 99%

Strelka2: Fast and accurate variant calling for clinical sequencing applications

Scheffler

et al. 2017

Preprint

View full text Add to dashboard Cite

We describe Strelka2 (https://github.com/Illumina/strelka), an open-source small variant calling method for clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model based estimation of indel error parameters from each sample, an efficient tiered haplotype modeling strategy and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperforms current leading tools on both variant calling accuracy and compute cost.Whole-genome sequencing is rapidly transitioning into a tool for clinical research and diagnosis, a shift which brings new challenges for sequence analysis methods. While there has been considerable progress in developing methods to improve germline and somatic small variant calling accuracy in research applications [1][2][3][4][5][6] , such methods can be further improved in many respects for the clinical wholegenome sequencing scenario. These improvements include reducing the compute cost/turn-around time of whole-genome analysis, further increasing indel calling accuracy, automating parameter tuning without expert user intervention, and reducing multiple indicators of call quality to a single confidence score for variant prioritization. Here we describe Strelka2, a variant calling method building upon the innovative Strelka somatic variant caller 7 , to improve upon these aspects of variant calling for both germline and somatic analysis. We demonstrate that Strelka2 is both more accurate and substantially faster when compared to current best-in-class small variant calling methods.Strelka2 germline and somatic analyses share a common series of high-level stages, including parameter estimation from sample data, candidate variant discovery, realignment, variant probability inference, and empirical re-scoring/filtration. The composition of these steps is described in more detail for each type of analysis in Supplementary Fig. 1. Strelka2's germline analysis introduces a novel step to adaptively estimate indel error rates from preliminary allele counts in each sample, using a mixture model to estimate both indel variant mutation rates and indel noise rates from a set of error processes (Supplementary Fig. 2). This mixture approach mitigates the impact of context-specific indel error rate variation on variant call accuracy and obviates the need to specify a prior set of common population variants.Similar to previous work 2, 3,5,6 , Strelka2's germline analysis models haplotypes to provide read-backed variant phasing and reduce the impact of sequencing noise, incorrect read mapping and inconsistent alignment. Strelka2's haplotype model uses an efficient tiered scheme for haplotype discovery, . CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/192872 doi: bioRxiv preprint first posted online Sep. 23, 2017; combining the advantages of a simple model based on ...

show abstract

Creating a universal SNP and small indel variant caller with deep neural networks

Cited by 83 publications

References 32 publications

DAVI:Deep Learning Based Tool for Alignment and Single Nucleotide Variant identification

DAVI:Deep Learning Based Tool for Alignment and Single Nucleotide Variant identification

Computational biology: deep learning

Strelka2: Fast and accurate variant calling for clinical sequencing applications

Contact Info

Product

Resources

About