Many stochastic simulation approaches for generating observations from a posterior distribution depend on knowing a likelihood function. However, for many complex probability models, such likelihoods are either impossible or computationally prohibitive to obtain. Here we present a Markov chain Monte Carlo method for generating observations from a posterior distribution without the use of likelihoods. It can also be used in frequentist applications, in particular for maximum-likelihood estimation. The approach is illustrated by an example of ancestral inference in population genetics. A number of open problems are highlighted in the discussion.O ne of the basic problems in Bayesian statistics is the computation of posterior distributions. We imagine data D generated from a model M determined by parameters , the prior density of which is denoted by ( ). We assume unless otherwise stated that the data are discrete. The posterior distribution of interest is f( ͉D), which is given bywhere (ސD) ϭ ͐ (ސD͉ ) ( )d is the normalizing constant. In most scientific contexts, explicit formulae for such posterior densities are few and far between, and we usually resort to stochastic simulation to generate observations from f. Perhaps the simplest approach for this is the rejection method: A1. Generate from (⅐). A2. Accept with probability h ϭ (ސD͉ ); return to A1. There are many variations on this theme. Of particular relevance here is the case in which the likelihood (ސD͉ ) cannot be computed explicitly. One obvious approach then is:The success of this approach depends on the fact that the underlying stochastic model M is easy to simulate. This approach can be useful when computation of the likelihood is possible but time-consuming.The practicality of algorithms such as these depends crucially on the size of (ސD), because the probability of accepting an observation is proportional to (ސD). In cases where the acceptance rate is too small, one might resort to approximate methods such as: This approach requires selection of a suitable metric as well as a choice of . As 3 ϱ it generates observations from the prior. If ϭ 0, an observation DЈ is accepted only if DЈ ϭ D, and then accepted observations come from the density f( ͉D). The choice of therefore reflects a tension between computability and accuracy. The method is still honest in that, for a given and , we are generating independent and identically distributed observations from f( ͉ (D, DЈ) Յ ).When D is high-dimensional or continuous, this approach can be impractical as well, and then the comparison of DЈ with D can be made by using lower-dimensional summaries of the data. The motivation for this approach is that if the set of statistics S ϭ (S 1 , . . . , S p ) is sufficient for , in that (ސD͉S, ) is independent of , then f( ͉D) ϭ f( ͉S). The normalizing constant (ސS) is typically larger than (ސD), resulting in more acceptances. In practice it will be hard, if not impossible, to identity a suitable set of sufficient statistics, and we then might resort to ...
Motivation: Recently there has been increasing interest in the effects of cell mixture on the measurement of DNA methylation, specifically the extent to which small perturbations in cell mixture proportions can register as changes in DNA methylation. A recently published set of statistical methods exploits this association to infer changes in cell mixture proportions, and these methods are presently being applied to adjust for cell mixture effect in the context of epigenome-wide association studies. However, these adjustments require the existence of reference datasets, which may be laborious or expensive to collect. For some tissues such as placenta, saliva, adipose or tumor tissue, the relevant underlying cell types may not be known.Results: We propose a method for conducting epigenome-wide association studies analysis when a reference dataset is unavailable, including a bootstrap method for estimating standard errors. We demonstrate via simulation study and several real data analyses that our proposed method can perform as well as or better than methods that make explicit use of reference datasets. In particular, it may adjust for detailed cell type differences that may be unavailable even in existing reference datasets.Availability and implementation: Software is available in the R package RefFreeEWAS. Data for three of four examples were obtained from Gene Expression Omnibus (GEO), accession numbers GSE37008, GSE42861 and GSE30601, while reference data were obtained from GEO accession number GSE39981.Contact: andres.houseman@oregonstate.eduSupplementary information: Supplementary data are available at Bioinformatics online.
There is currently tremendous interest in the possibility of using genome-wide association mapping to identify genes responsible for natural variation, particularly for human disease susceptibility. The model plant Arabidopsis thaliana is in many ways an ideal candidate for such studies, because it is a highly selfing hermaphrodite. As a result, the species largely exists as a collection of naturally occurring inbred lines, or accessions, which can be genotyped once and phenotyped repeatedly. Furthermore, linkage disequilibrium in such a species will be much more extensive than in a comparable outcrossing species. We tested the feasibility of genome-wide association mapping in A. thaliana by searching for associations with flowering time and pathogen resistance in a sample of 95 accessions for which genome-wide polymorphism data were available. In spite of an extremely high rate of false positives due to population structure, we were able to identify known major genes for all phenotypes tested, thus demonstrating the potential of genome-wide association mapping in A. thaliana and other species with similar patterns of variation. The rate of false positives differed strongly between traits, with more clinal traits showing the highest rate. However, the false positive rates were always substantial regardless of the trait, highlighting the necessity of an appropriate genomic control in association studies.
An age-dependent association between variation at the FTO locus and BMI in children has been suggested. We meta-analyzed associations between the FTO locus (rs9939609) and BMI in samples, aged from early infancy to 13 years, from 8 cohorts of European ancestry. We found a positive association between additional minor (A) alleles and BMI from 5.5 years onwards, but an inverse association below age 2.5 years. Modelling median BMI curves for each genotype using the LMS method, we found that carriers of minor alleles showed lower BMI in infancy, earlier adiposity rebound (AR), and higher BMI later in childhood. Differences by allele were consistent with two independent processes: earlier AR equivalent to accelerating developmental age by 2.37% (95% CI 1.87, 2.87, p = 10−20) per A allele and a positive age by genotype interaction such that BMI increased faster with age (p = 10−23). We also fitted a linear mixed effects model to relate genotype to the BMI curve inflection points adiposity peak (AP) in infancy and AR. Carriage of two minor alleles at rs9939609 was associated with lower BMI at AP (−0.40% (95% CI: −0.74, −0.06), p = 0.02), higher BMI at AR (0.93% (95% CI: 0.22, 1.64), p = 0.01), and earlier AR (−4.72% (−5.81, −3.63), p = 10−17), supporting cross-sectional results. Overall, we confirm the expected association between variation at rs9939609 and BMI in childhood, but only after an inverse association between the same variant and BMI in infancy. Patterns are consistent with a shift on the developmental scale, which is reflected in association with the timing of AR rather than just a global increase in BMI. Results provide important information about longitudinal gene effects and about the role of FTO in adiposity. The associated shifts in developmental timing have clinical importance with respect to known relationships between AR and both later-life BMI and metabolic disease risk.
Standard regression analyses are often plagued with problems encountered when one tries to make inference going beyond main effects using data sets that contain dozens of variables that are potentially correlated. This situation arises, for example, in epidemiology where surveys or study questionnaires consisting of a large number of questions yield a potentially unwieldy set of interrelated data from which teasing out the effect of multiple covariates is difficult. We propose a method that addresses these problems for categorical covariates by using, as its basic unit of inference, a profile formed from a sequence of covariate values. These covariate profiles are clustered into groups and associated via a regression model to a relevant outcome. The Bayesian clustering aspect of the proposed modeling framework has a number of advantages over traditional clustering approaches in that it allows the number of groups to vary, uncovers subgroups and examines their association with an outcome of interest, and fits the model as a unit, allowing an individual's outcome potentially to influence cluster membership. The method is demonstrated with an analysis of survey data obtained from the National Survey of Children's Health. The approach has been implemented using the standard Bayesian modeling software, WinBUGS, with code provided in the supplementary material available at Biostatistics online. Further, interpretation of partitions of the data is helped by a number of postprocessing tools that we have developed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.