The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.• M-open refers to the situation in which we know the true model M t is not in M, but we cannot specify the explicit form p(ỹ|y) = p(ỹ|M t , y) because it is too difficult, we lack time to do so, or do not have the expertise, computational intractability, etc.BMA is appropriate for M-closed case. In the M-open and M-complete case, BMA will asymptotically select the one single model in the list that is closest in Kullback-Leibler (KL) divergence.Furthermore, in BMA, the marginal likelihood depends sensitively on the specified prior p(θ k |M k ) for each model. For example, consider a problem where a parameter has
We applied three Bayesian methods to reanalyse the preregistered contributions to the Social Psychology special issue ‘Replications of Important Results in Social Psychology’ (Nosek & Lakens. 2014 Registered reports: a method to increase the credibility of published results. Soc. Psychol. 45, 137–141. (doi:10.1027/1864-9335/a00019210.1027/1864-9335/a000192)). First, individual-experiment Bayesian parameter estimation revealed that for directed effect size measures, only three out of 44 central 95% credible intervals did not overlap with zero and fell in the expected direction. For undirected effect size measures, only four out of 59 credible intervals contained values greater than 0.10 (10% of variance explained) and only 19 intervals contained values larger than 0.05. Second, a Bayesian random-effects meta-analysis for all 38 t-tests showed that only one out of the 38 hierarchically estimated credible intervals did not overlap with zero and fell in the expected direction. Third, a Bayes factor hypothesis test was used to quantify the evidence for the null hypothesis against a default one-sided alternative. Only seven out of 60 Bayes factors indicated non-anecdotal support in favour of the alternative hypothesis (normalBF10>3), whereas 51 Bayes factors indicated at least some support for the null hypothesis. We hope that future analyses of replication success will embrace a more inclusive statistical approach by adopting a wider range of complementary techniques.
What was not but could be ifThe most important aspect of communicating statistical method to a new audience is to carefully and accurately sketch out the types of problems where it is applicable. As people who think leave-one-out cross validation (LOO-CV or LOO for short) is a good method for model comparison and model criticism, we were pleased to discover that Gronau and Wagenmakers (2018, henceforth GW) chose to write a paper aimed at explaining the nuances of LOO methods to a psychology audience. Unfortunately, we do not think the criticisms and discussions provided in their paper are so relevant to LOO as we understand it. The variant of LOO that GW discuss is at odds with a long literature on how to use LOO well; they focus on pathologizing a known and essentially unimportant property of the method; and they fail to discuss the most common issues that arise when using LOO for a real statistical analysis. In this discussion we try to discuss a number of concerns that everyone needs to think about before using LOO, reinterpret GW's examples, and try to explain the benefits of allowing for epistemological uncertainty when performing model selection.2. We need to abandon the idea that there is a device that will produce a singlenumber decision ruleThe most pernicious idea in statistics is the idea that we can produce a single-number summary of any data set and this will be enough to make a decision. This view is perpetuated by GW's paper, which says that the only way that LOO can provide evidence for choosing a single model is for the pseudo-Bayes Factor to grow without bound (or, equivalently, that the model weight approaches 1) as sample size increases. This is not a good way to use LOO and fundamentally misjudges both its potential and its limitations as a tool for model selection and model criticism. For a Bayesian model with n data points y i ∼ p(y|θ) and parameters θ ∼ p(θ), LOO provides an estimate of the expected log posterior predictive distribution,where the expectation is taken with respect to new data, y all is all n observed data points, and y −i is all data points except the ith one. There are two things to note here. Firstly, the computed LOO score is an empirical approximation to the expectation that we actually want to compute. This means that we must never consider it
Stacking is a widely used model averaging technique that asymptotically yields optimal predictions among linear averages. We show that stacking is most effective when model predictive performance is heterogeneous in inputs, and we can further improve the stacked mixture with a hierarchical model. We generalize stacking to Bayesian hierarchical stacking. The model weights are varying as a function of data, partially-pooled, and inferred using Bayesian inference. We further incorporate discrete and continuous inputs, other structured priors, and time series and longitudinal data. To verify the performance gain of the proposed method, we derive theory bounds, and demonstrate on several applied problems.
When working with multimodal Bayesian posterior distributions, Markov chain Monte Carlo (MCMC) algorithms can have difficulty moving between modes, and default variational or mode-based approximate inferences will understate posterior uncertainty. And, even if the most important modes can be found, it is difficult to evaluate their relative weights in the posterior.Here we propose an alternative approach, using parallel runs of MCMC, variational, or mode-based inference to hit as many modes or separated regions as possible, and then combining these using importance sampling based Bayesian stacking, a scalable method for constructing a weighted average of distributions so as to maximize cross-validated prediction utility. The result from stacking is not necessarily equivalent, even asymptotically, to fully Bayesian inference, but it serves many of the same goals. Under misspecified models, stacking can give better predictive performance than full Bayesian inference, hence the multimodality can be considered a blessing rather than a curse. We explore with an example where the stacked inference approximates the true data generating process from the misspecified model, an example of inconsistent inference, and non-mixing samplers. We elaborate the practical implantation in the context of latent Dirichlet allocation, Gaussian process regression, hierarchical model, variational inference in horseshoe regression, and neural networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.