SummaryMost common human diseases are likely to have complex etiologies. Methods of analysis that allow for the phenomenon of epistasis are of growing interest in the genetic dissection of complex diseases. By allowing for epistatic interactions between potential disease loci, we may succeed in identifying genetic variants that might otherwise have remained undetected. Here we aimed to analyze the ability of logistic regression (LR) and two tree-based supervised learning methods, classification and regression trees (CART) and random forest (RF), to detect epistasis. Multifactor-dimensionality reduction (MDR) was also used for comparison. Our approach involves first the simulation of datasets of autosomal biallelic unphased and unlinked single nucleotide polymorphisms (SNPs), each containing a two-loci interaction (causal SNPs) and 98 'noise' SNPs. We modelled interactions under different scenarios of sample size, missing data, minor allele frequencies (MAF) and several penetrance models: three involving both (indistinguishable) marginal effects and interaction, and two simulating pure interaction effects. In total, we have simulated 99 different scenarios. Although CART, RF, and LR yield similar results in terms of detection of true association, CART and RF perform better than LR with respect to classification error. MAF, penetrance model, and sample size are greater determining factors than percentage of missing data in the ability of the different techniques to detect true association. In pure interaction models, only RF detects association. In conclusion, tree-based methods and LR are important statistical tools for the detection of unknown interactions among true risk-associated SNPs with marginal effects and in the presence of a significant number of noise SNPs. In pure interaction models, RF performs reasonably well in the presence of large sample sizes and low percentages of missing data. However, when the study design is suboptimal (unfavourable to detect interaction in terms of e.g. sample size and MAF) there is a high chance of detecting false, spurious associations.
BackgroundA large body of genetic research has focused on the potential role that mitochondrial DNA (mtDNA) variants might play on the predisposition to common and complex (multi-factorial) diseases. It has been argued however that many of these studies could be inconclusive due to artifacts related to genotyping errors or inadequate design.MethodsAnalyses of the data published in case–control breast cancer association studies have been performed using a phylogenetic-based approach. Variation observed in these studies has been interpreted in the light of data available on public resources, which now include over >27,000 complete mitochondrial sequences and the worldwide phylogeny determined by these mitogenomes. Complementary analyses were carried out using public datasets of partial mtDNA sequences, mainly corresponding to control-region segments.ResultsBy way of example, we show here another kind of fallacy in these medical studies, namely, the phenomenon of SNP-SNP interaction wrongly applied to haploid data in a breast cancer study. We also reassessed the mutually conflicting studies suggesting some functional role of the non-synonymous polymorphism m.10398A > G (ND3 subunit of mitochondrial complex I) in breast cancer. In some studies, control groups were employed that showed an extremely odd haplogroup frequency spectrum compared to comparable information from much larger databases. Moreover, the use of inappropriate statistics signaled spurious “significance” in several instances.ConclusionsEvery case–control study should come under scrutiny in regard to the plausibility of the control-group data presented and appropriateness of the statistical methods employed; and this is best done before potential publication.
There is growing interest in developing additional DNA typing techniques to provide better investigative leads in forensic analysis. These include inference of genetic ancestry and prediction of common physical characteristics of DNA donors. To date, forensic ancestry analysis has centered on population-divergent SNPs but these binary loci cannot reliably detect DNA mixtures, common in forensic samples. Furthermore, STR genotypes, forming the principal DNA profiling system, are not routinely combined with forensic SNPs to strengthen frequency data available for ancestry inference. We report development of a 12-STR multiplex composed of ancestry informative marker STRs (AIM-STRs) selected from 434 tetranucleotide repeat loci. We adapted our online Bayesian classifier for AIM-SNPs: Snipper, to handle multiallele STR data using frequency-based training sets. We assessed the ability of the 12-plex AIM-STRs to differentiate CEPH Human Genome Diversity Panel populations, plus their informativeness combined with established forensic STRs and AIM-SNPs. We found combining STRs and SNPs improves the success rate of ancestry assignments while providing a reliable mixture detection system lacking from SNP analysis alone. As the 12 STRs generally show a broad range of alleles in all populations, they provide highly informative supplementary STRs for extended relationship testing and identification of missing persons with incomplete reference pedigrees. Lastly, mixed marker approaches (combining STRs with binary loci) for simple ancestry inference tests beyond forensic analysis bring advantages and we discuss the genotyping options available.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.