2010
DOI: 10.1101/gr.107524.110
|View full text |Cite
|
Sign up to set email alerts
|

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

15
18,635
1
27

Year Published

2012
2012
2022
2022

Publication Types

Select...
9

Relationship

1
8

Authors

Journals

citations
Cited by 23,067 publications
(19,387 citation statements)
references
References 27 publications
15
18,635
1
27
Order By: Relevance
“…Initially, the gatk (McKenna et al., 2010) variant calling pipeline identified 23,185 indels and 727,350 SNPs in the data. After retaining only SNPs and implementing a set of variant filters in gatk , we retained 682,118 high‐quality SNPs, of which 11,583 were multiallelic across the Dactylorhiza–Orchis group.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Initially, the gatk (McKenna et al., 2010) variant calling pipeline identified 23,185 indels and 727,350 SNPs in the data. After retaining only SNPs and implementing a set of variant filters in gatk , we retained 682,118 high‐quality SNPs, of which 11,583 were multiallelic across the Dactylorhiza–Orchis group.…”
Section: Resultsmentioning
confidence: 99%
“…The mapping has been performed with the second‐pass approach (Engström et al., 2013) of star v2.4.1 d (Dobin et al., 2013), by lowering the maximum allowed ratio of mismatches to read length to 0.11. The best practices recommendations (DePristo et al., 2011; Van Auwera et al., 2013) for gatk version 3 (McKenna et al., 2010) have been followed, but with a hybrid approach between RNA and DNA sequencing as the analyses have been performed on a reference transcriptome, not a full genome. After processing the BAM files by adding read groups and removing duplicates with picard tools (v.1.119, http://broadinstitute.github.io/picard/), we split the reads into exon segments and reassigned star mapping qualities.…”
Section: Methodsmentioning
confidence: 99%
“…Genomic DNA was captured using SureSelectXT Human All Exon V5 (50 Mb) or V6 (60 Mb) Kits (Agilent Technologies) and sequenced on a HiSeq2500 (Illumina) with 126‐base pair paired‐end reads. The reads were mapped to the hg19 human reference using Burrows‐Wheeler Aligner (BWA) 0.6.2‐r12615 and single‐nucleotide variants (SNVs), and insertions and/or deletions (indels) were called using the Genome Analysis Toolkit (GATK) v. 1.6–13 16. After quality filtering steps, variants were annotated using ANNOVAR 17.…”
Section: Methodsmentioning
confidence: 99%
“…Genotype‐calling software programs use either maximum‐likelihood (e.g., Stacks; Catchen et al., 2011) or Bayesian models (e.g., GATK; McKenna et al., 2010; dePristo et al., 2011; Van der Auwera et al., 2013) to assign individuals with genotypes. These models often incorporate some element of sequencing error, but the primary determinant of whether individuals are accurately genotyped as heterozygous or homozygous is the number of reads assigned to each individual.…”
Section: Design and Implement: Assessmentmentioning
confidence: 99%