Accurate genetic data are important prerequisite of performing genetic linkage test or association test. Currently, most analytical methods assume that the observed genotypes are correct. However, due to the constraint at the technical level, most of the genetic data that people used so far contain errors. In this paper, we considered the problem of QTL mapping based on biological data with genotyping errors. By analysing all possible genotypes of each individual in framework of multipleinterval mapping, we proposed an algorithm of inferring all model parameters through the expectation-maximization (EM) algorithm and discussed the hypothesis testing of the existence of QTL. We carried out extensive simulation studies to assess the proposed method. Simulation results showed that the new method outperforms the method that does not take the genotyping errors into account, and therefore it can decrease the impact of genotyping errors on QTL mapping. The proposed method was also applied to analyse a real barley dataset.
Classification of imbalanced data is a challenging task that has captured considerable interest in numerous scientific fields by virtue of the great practical value of minority accuracy. Some methods for improving generalization performance have been developed to address this classification situation. Here, we propose a cost-sensitive ensemble learning method using a support vector machine as a base learner of AdaBoost for classifying imbalanced data. Considering that the existing methods are not well studied in terms of how to precisely control the classification accuracy of the minority class, we developed a novel way to rebalance the weights of AdaBoost, and the weights influence the base learner training. This weighting strategy increases the sample weight of the misclassified minority while decreasing the sample weight of the misclassified majority until their distributions are even in each round. Furthermore, we included P-mean as one of the assessment markers and discussed why it is necessary. Experiments were conducted to compare the proposed and comparison 10 models on 18 datasets in terms of six different metrics. Through comprehensive experimental findings, the statistical study is performed to verify the efficacy and usability of the proposed model.
With a large number of quantitative trait loci being identified in genome-wide association studies, researchers have become more interested in detecting interactions among genes or single nucleotide polymorphisms (SNPs). In this research, we carried out a two-stage model selection procedure to detect interacting gene pairs or SNP pairs associated with four important traits of outbred mice, including glucose, high-density lipoprotein cholesterol, diastolic blood pressure and triglyceride. In the first stage, a variance heterogeneity test was used to screen for candidate SNPs. In the second stage, the Lasso method and single pair analysis were used to select two-way interactions. Moreover, the shared Gene Ontology information about the selected interacting gene pairs was considered to study the interactions auxiliarily. Based on this method, we not only replicated the identification of important SNPs associated with each trait of outbred mice, but also found some SNP pairs and gene pairs with significant interaction effects on each trait. Simulation studies were also conducted to evaluate the performance of the two-stage method in different situations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.