Purpose: To provide a validated method to confidently identify exon-containing copy number variants (CNVs), with a low false discovery rate (FDR), in targeted sequencing data from a clinical laboratory with particular focus on single-exon CNVs.Methods: DNA sequence coverage data are normalized within each sample and subsequently exonic CNVs are identified in a batch of samples (midpool), when the target log2 ratio of the sample to the batch median exceeds defined thresholds. The quality of exonic CNV calls is assessed by C-scores (Z-like scores) using thresholds derived from gold standard samples and simulation studies. We integrate an ExonQC threshold to lower FDR and compare performance with alternate software (VisCap).
Results: Thirteen CNVs were used as a truth set to validate Atlas-CNV and compared withVisCap. We demonstrated FDR reduction in validation, simulation and 10,926 eMERGESeq 5 samples without sensitivity loss. Sixty-four multi-exon and 29 single-exon CNVs with high C-scores were assessed by MLPA.
Conclusions:Atlas-CNV is validated as a method to identify exonic CNVs in targeted sequencing data generated in the clinical laboratory. The ExonQC and C-score assignment can reduce FDR (identification of targets with high variance) and improve calling accuracy of single-exon CNVs respectively. We proposed guidelines and criteria to identify high confidence single-exon CNVs.Atlas-CNV is available for public download at http://github.com/theodorc/atlas-cnv, with the initial version (0.2) written in Perl (5.12.2) and R (3.1.1). Three inputs are required: (1) GATK DoC interval summary files, (2), a panel design containing target exons, and (3) a sample file with gender and/or midpool groupings.
Clinical SequencingOur clinical pipeline routinely processes about 45 samples in each midpool capture experiment. Briefly, sample DNA is isolated, sheared, ligated to barcode adapters for multiplexing, then incubated with capture probes, and sequenced on Illumina HiSeq 2500 instruments with two midpools loaded on a single flow-cell lane. Paired-end reads are aligned to the hg19 reference using bwa-0.6.2 17 with GATK-2.5.2 18 for realignment, recalibration, and depth of coverage calculations (DoC).
RPKM Normalization and Sample QualityThe read depth (RD) data is normalized at the individual sample level. GATK-DoC output is used to obtain average RD per target, and these values are normalized as a fraction of the sample coverage with RPKM normalization as illustrated in Figure 1A. Essentially, this step converts the average RD per target to the equivalent number of reads (100bp/read) andreports the proportion to the total number of mapped reads in the sample per million. At each exon, the median sample is selected as the reference after removing the 5% outliers Cooperative/University of Washington, Seattle); U01HG8685 (Brigham and Women's Hospital);