DNA methylation studies have enabled researchers to understand methylation patterns and their regulatory roles in biological processes and disease. However, only a limited number of statistical approaches have been developed to provide formal quantitative analysis. Specifically, a few available methods do identify differentially methylated CpG (DMC) sites or regions (DMR), but they suffer from limitations that arise mostly due to challenges inherent in bisulfite sequencing data. These challenges include: (1) that read-depths vary considerably among genomic positions and are often low; (2) both methylation and autocorrelation patterns change as regions change; and (3) CpG sites are distributed unevenly. Furthermore, there are several methodological limitations: almost none of these tools is capable of comparing multiple groups and/or working with missing values, and only a few allow continuous or multiple covariates. The last of these is of great interest among researchers, as the goal is often to find which regions of the genome are associated with several exposures and traits. To tackle these issues, we have developed an efficient DMC identification method based on Hidden Markov Models (HMMs) called "DMCHMM" which is a three-step approach (model selection, prediction, testing) aiming to address the aforementioned drawbacks. Our proposed method is different from other HMM methods since it profiles methylation of each sample separately, hence exploiting inter-CpG autocorrelation within samples, and it is more flexible than previous approaches by allowing multiple hidden states. Using simulations, we show that DMCHMM has the best performance among several competing methods. An analysis of cell-separated blood methylation profiles is also provided.
Mixed models are commonly used for the analysis of small area estimation. In particular, small area estimation has been extensively studied under linear mixed models. Recently, small area estimation under the linear mixed model with penalized spline (P‐spline) regression model, for fixed part of the model, has been proposed. However, in practice there are many situations that we have counts or proportions in small areas; for example a dataset on the number of asthma physician visits in small areas in Manitoba. In particular, the covariates age, genetic, environmental factors, among other covariates seem to predict asthma physician visits, however, these relationships may not be linear (see Section 5). In this paper, small area estimation under generalized linear mixed models using P‐spline regression models is proposed to cover Normal and non‐Normal responses. In particular, the empirical best predictor of small area parameters with corresponding prediction intervals are studied. The performance of the proposed approach is evaluated through simulation studies and also by a real dataset. The Canadian Journal of Statistics 43: 82–96; 2015 © 2015 Statistical Society of Canada
The advent of modern technology has led to a surge of high-dimensional data in biology and health sciences such as genomics, epigenomics and medicine. The high-grade serous ovarian cancer (HGS-OvCa) data reported by The Cancer Genome Atlas (TCGA) Research Network is one example. The TCGA and other research groups have analyzed several aspects of these data. Here we study the relationship between Disease Free Time (DFT) after surgery among ovarian cancer patients and their DNA methylation profiles of genomic features. Such studies pose additional challenges beyond the typical big data problem due to population substructure and censoring. Despite the availability of several methods for analyzing time-to-event data with a large number of covariates but a small sample size, there is no method available to date that accommodates the additional feature of heterogeneity. To this end, we propose a regularized framework based on the finite mixture of accelerated failure time model to capture intangible heterogeneity due to population substructure and to account for censoring simultaneously. We study the properties of the proposed framework both theoretically and numerically. Our data analysis indicates the existence of heterogeneity in the HGS-OvCa data, with one component of the mixture capturing a more aggressive form of the disease, and the second component capturing a less aggressive form. In particular, the second component portrays a significant positive relationship between methylation and DFT for BRCA1. By further unearthing the negative relationship between expression and methylation for this gene, one may provide a biologically reasonable explanation that sheds light on the relationship between DNA methylation, gene expression and mutation.
In survey sampling, policymaking regarding the allocation of resources to subgroups (called small areas) or the determination of subgroups with specific properties in a population should be based on reliable estimates. Information, however, is often collected at a different scale than that of these subgroups; hence, the estimation can only be obtained on finer scale data. Parametric mixed models are commonly used in small-area estimation. The relationship between predictors and response, however, may not be linear in some real situations. Recently, small-area estimation using a generalised linear mixed model (GLMM) with a penalised spline (P-spline) regression model, for the fixed part of the model, has been proposed to analyse cross-sectional responses, both normal and non-normal. However, there are many situations in which the responses in small areas are serially dependent over time. Such a situation is exemplified by a data set on the annual number of visits to physicians by patients seeking treatment for asthma, in different areas of Manitoba, Canada. In cases where covariates that can possibly predict physician visits by asthma patients (e.g. age and genetic and environmental factors) may not have a linear relationship with the response, new models for analysing such data sets are required. In the current work, using both time-series and cross-sectional data methods, we propose P-spline regression models for small-area estimation under GLMMs. Our proposed model covers both normal and non-normal responses. In particular, the empirical best predictors of small-area parameters and their corresponding prediction intervals are studied with the maximum likelihood estimation approach being used to estimate the model parameters. The performance of the proposed approach is evaluated using some simulations and also by analysing two real data sets (precipitation and asthma).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.