Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
Genome-Wide-Association-Studies have become a powerful method to link point mutations (e.g. single nucleotide polymorphisms (SNPs)) to a certain phenotype or a disease. However, their power to detect SNPs associated to polygenic diseases such as Alzheimer's Disease (AD) is limited, since they can only infer the pairwise relation of single SNPs to the phenotype and ignore possible effects of various SNP combinations. The common method to probe these possible complex genetic patterns is to compute a measure called linkage disequilibrium (LD). Despite the fact that several predictive patterns found with LD could successfully be applied to medical diagnosis, this measure still holds several drawbacks as for example the difficulty to confirm and replicate experimental results as well as its sensitivity to statistical biases. Here, we present the application of an alternative method, Linkage Probability (LP) for genetic pattern identification that provides the posterior probability of a relation between two categorical data sets and simultaneously considers potential biases from latent variables, such as the recombination rate or the genetic structure of a population. By applying the LP framework to data from the ADSP-Project, we show that changes of linkage patterns between SNPs can be associated to Alzheimer's disease. Common genomic relation measures still fail to extract this link.
Soils are inhabited by communities of tiny invertebrates that participate in the essential functions of soils. Characterizing those communities in terms of species diversity and species abundance is part of investigating soil functions and response to perturbation. Dozens to hundreds of specimens can be extracted from a sample that need to be sorted, counted and identified. It involves an enormous amount of time, straining the workflow of soil zoologists. Deep learning‐based computer vision approaches have become increasingly popular to monitor biodiversity, as they can be applied to detect, count and classify organisms. In this work, we present CollembolAI, an open‐source prototype for a computer vision workflow. It includes a hardware system for acquiring high‐resolution pictures of soil species samples in fluid and a deep learning‐based application (Faster R‐CNN with Slicing Aided Hyper Inference) to train and evaluate models for the detection and classification of animals on those pictures. We evaluated the workflow using a mix of specimens belonging to 12 species of springtail and mite, picked from our taxonomic collection. Specimens were photographed multiple times under various angle of views. The model was trained using 5671 views on specimens on 30 images and tested on 442 views from new specimens on six images. CollembolAI is affordable, simple to build and allows the rapid digitization of mesofauna samples. Our deep learning model achieved a Precision of 0.940, Recall of 0.918 and a mAP@0.5 (Pascal VOC) of 0.868. The model showed a lower Recall for one species, but was performant on all others. Our prototype offers an operational workflow for the creation of soil fauna picture datasets needed to develop efficient deep learning‐based classifiers. The applications are numerous, for example, collection digitization, soil biodiversity analysis and monitoring, or automatic assessment of mesofauna‐based bioindicators. Computer vision is a rapidly emerging tool to handle efficiently the complexity of biodiversity and is already successfully used for plants and vertebrates recognition of pictures. It is on the way to become a major asset for dealing with invertebrate diversity. The code and setup instructions are available on Github.
Mislabeling of cases as well as controls in case–control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled data in the situations where both, the case and the control groups, contain a non-negligible proportion of mislabeled data instances. This is an especially prominent issue in studies regarding late-onset conditions, where individuals who may convert to cases may populate the control group, and for screening studies that often have high false-positive/-negative rates. To address this problem, we propose a method for a simultaneous robust inference of Lasso reduced discriminative models and of latent group-specific mislabeling risks, not requiring any exactly labeled data. We apply it to a standard breast cancer imaging dataset and infer the mislabeling probabilities (being rates of false-negative and false-positive core-needle biopsies) together with a small set of simple diagnostic rules, outperforming the state-of-the-art BI-RADS diagnostics on these data. The inferred mislabeling rates for breast cancer biopsies agree with the published purely empirical studies. Applying the method to human genomic data from a healthy-ageing cohort reveals a previously unreported compact combination of single-nucleotide polymorphisms that are strongly associated with a healthy-ageing phenotype for Caucasians. It determines that 7.5% of Caucasians in the 1000 Genomes dataset (selected as a control group) carry a pattern characteristic of healthy ageing.
Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact.Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (InDels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in InDel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.