BackgroundAnalyzing next-generation sequencing data is difficult because datasets are large, second generation sequencing platforms have high error rates, and because each position in the target genome (exome, transcriptome, etc.) is sequenced multiple times. Given these challenges, numerous bioinformatic algorithms have been developed to analyze these data. These algorithms aim to find an appropriate balance between data loss, errors, analysis time, and memory footprint. Typical analysis pipelines require multiple steps. If one or more of these steps is unnecessary, it would significantly decrease compute time and data manipulation to remove the step. One step in many pipelines is PCR duplicate removal, where PCR duplicates arise from multiple PCR products from the same template molecule binding on the flowcell. These are often removed because there is concern they can lead to false positive variant calls. Picard (MarkDuplicates) and SAMTools (rmdup) are the two main softwares used for PCR duplicate removal.ResultsApproximately 92 % of the 17+ million variants called were called whether we removed duplicates with Picard or SAMTools, or left the PCR duplicates in the dataset. There were no significant differences between the unique variant sets when comparing the transition/transversion ratios (p = 1.0), percentage of novel variants (p = 0.99), average population frequencies (p = 0.99), and the percentage of protein-changing variants (p = 1.0). Results were similar for variants in the American College of Medical Genetics genes. Genotype concordance between NGS and SNP chips was above 99 % for all genotype groups (e.g., homozygous reference).ConclusionsOur results suggest that PCR duplicate removal has minimal effect on the accuracy of subsequent variant calls.
Despite expanding research on the popular recreational fishery, bonefish taxonomy remains murky. The genus Albula, comprising these iconic circumtropical marine sportfishes, has a complex taxonomic history driven by highly conserved morphology. Presently, 12 putative species are spread among 3 species complexes. The cryptic morphology hinders visual identification, requiring genetic species identification in some cases. Unclear nomenclature can have unintended consequences, including exacerbating taxonomic uncertainty and complicating resolution efforts. Further, ignoring this reality in publications may erode management and conservation efforts. In the Indian and Pacific oceans, ranges and areas of overlap are unclear, precluding certainty about which species support the fishery and hindering conservation efforts. Species overlap, at both broad and localized spatial scales, may mask population declines if one is targeted primarily (as demonstrated in the western Atlantic fishery). Additional work is necessary, especially to increase our understanding of spatiotemporal ecology across life history stages and taxa. If combined with increased capacity to discern between cryptic species, population structure may be ascertained, and fisheries stakeholders will be enabled to make informed decisions. To assist in such efforts, we have constructed new range maps for each species and species complex. For bonefishes, conservation genomic approaches may resolve lingering taxonomic uncertainties, supporting effective conservation and management efforts. These methods apply broadly to taxonomic groups with cryptic diversity, aiding species delimitation and taxonomic revisions.
Supplementary data are available at Bioinformatics online.
MotivationOne of the main challenges with bioinformatics software is that the size and complexity of datasets necessitate trading speed for accuracy, or completeness. To combat this problem of computational complexity, a plethora of heuristic algorithms have arisen that report a ‘good enough’ solution to biological questions. However, in instances such as Simple Sequence Repeats (SSRs), a ‘good enough’ solution may not accurately portray results in population genetics, phylogenetics and forensics, which require accurate SSRs to calculate intra- and inter-species interactions.ResultsWe present Kmer-SSR, which finds all SSRs faster than most heuristic SSR identification algorithms in a parallelized, easy-to-use manner. The exhaustive Kmer-SSR option has 100% precision and 100% recall and accurately identifies every SSR of any specified length. To identify more biologically pertinent SSRs, we also developed several filters that allow users to easily view a subset of SSRs based on user input. Kmer-SSR, coupled with the filter options, accurately and intuitively identifies SSRs quickly and in a more user-friendly manner than any other SSR identification algorithm.Availability and implementationThe source code is freely available on GitHub at https://github.com/ridgelab/Kmer-SSR.
Summary: Simple Sequence Repeats (SSRs) are used to address a variety of research questions in a variety of fields (e.g. population genetics, phylogenetics, forensics, etc.), due to their high mutability within and between species. Here, we present an innovative algorithm, SA-SSR, based on suffix and longest common prefix arrays for efficiently detecting SSRs in large sets of sequences. Existing SSR detection applications are hampered by one or more limitations (i.e. speed, accuracy, ease-of-use, etc.). Our algorithm addresses these challenges while being the most comprehensive and correct SSR detection software available. SA-SSR is 100% accurate and detected >1000 more SSRs than the second best algorithm, while offering greater control to the user than any existing software.Availability and implementation: SA-SSR is freely available at http://github.com/ridgelab/SA-SSRContact: perry.ridge@byu.eduSupplementary information: Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.