Current methods for evaluating the accuracy of germline variant calls are restricted to easy-to-detect high-confidence regions, thus ignoring a substantial portion of difficult variants beyond the benchmark regions. We established four DNA reference materials from immortalized cell lines derived from a Chinese Quartet including parents and monozygotic twins. We integrated benchmark calls of 4.2 million small variants and 15,000 structural variants from multiple platforms and bioinformatic pipelines for evaluating the reliability of germline variant calls inside the benchmark regions. The genetic built-in-truth of the Quartet family design not only improved sensitivity of benchmark calls by removing additional false positive variants with apparently high quality, but also enabled estimation of the precision of variants calls outside the benchmark regions. Batch effects of variant calling in large-scale DNA sequencing efforts can be effectively identified with the concurrent use of the Quartet DNA reference materials along with study samples, and can be alleviated by training a machine learning model with the Quartet reference datasets to remove potential artifact calls. Matched RNA and protein reference materials were also established in the Quartet project, thereby enabling benchmark calls constructed from DNA reference materials for evaluation of variants calling performance on RNA and protein data. The Quartet DNA reference materials from this study are a resource for objective and comprehensive assessment of the accuracy of germline variant calls throughout the whole-genome regions.
BackgroundStructural variants (SVs) play a crucial role in gene regulation, trait association, and disease in humans. SV genotyping has been extensively applied in genomics research and clinical diagnosis. Although a growing number of SV genotyping methods for long reads have been developed, a comprehensive performance assessment of these methods has yet to be done.ResultsBased on one simulated and three real SV datasets, we performed an in-depth evaluation of five SV genotyping methods, including cuteSV, LRcaller, Sniffles, SVJedi, and VaPoR. The results show that for insertions and deletions, cuteSV and LRcaller have similar F1 scores (cuteSV, insertions: 0.69–0.90, deletions: 0.77–0.90 and LRcaller, insertions: 0.67–0.87, deletions: 0.74–0.91) and are superior to other methods. For duplications, inversions, and translocations, LRcaller yields the most accurate genotyping results (0.84, 0.68, and 0.47, respectively). When genotyping SVs located in tandem repeat region or with imprecise breakpoints, cuteSV (insertions and deletions) and LRcaller (duplications, inversions, and translocations) are better than other methods. In addition, we observed a decrease in F1 scores when the SV size increased. Finally, our analyses suggest that the F1 scores of these methods reach the point of diminishing returns at 20× depth of coverage.ConclusionsWe present an in-depth benchmark study of long-read SV genotyping methods. Our results highlight the advantages and disadvantages of each genotyping method, which provide practical guidance for optimal application selection and prospective directions for tool improvement.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.