We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries. T he lack of reproducibility of scientific studies has caused growing concern over the credibility of claims of new discoveries based on 'statistically significant' findings. There has been much progress toward documenting and addressing several causes of this lack of reproducibility (for example, multiple testing, P-hacking, publication bias and under-powered studies). However, we believe that a leading cause of non-reproducibility has not yet been adequately addressed: statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. Associating statistically significant findings with P < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems.For fields where the threshold for defining statistical significance for new discoveries is P < 0.05, we propose a change to P < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields. Results that would currently be called significant but do not meet the new threshold should instead be called suggestive. While statisticians have known the relative weakness of using P ≈ 0.05 as a threshold for discovery and the proposal to lower it to 0.005 is not new 1,2 , a critical mass of researchers now endorse this change.We restrict our recommendation to claims of discovery of new effects. We do not address the appropriate threshold for confirmatory or contradictory replications of existing claims. We also do not advocate changes to discovery thresholds in fields that have already adopted more stringent standards (for example, genomics and high-energy physics research; see the 'Potential objections' section below).We also restrict our recommendation to studies that conduct null hypothesis significance tests. We have diverse views about how best to improve reproducibility, and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P values. However, changing the P value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance.
Expert landmark correspondences are widely reported for evaluating deformable image registration (DIR) spatial accuracy. In this report, we present a framework for objective evaluation of DIR spatial accuracy using large sets of expert-determined landmark point pairs. Large samples (>1100) of pulmonary landmark point pairs were manually generated for five cases. Estimates of inter- and intra-observer variation were determined from repeated registration. Comparative evaluation of DIR spatial accuracy was performed for two algorithms, a gradient-based optical flow algorithm and a landmark-based moving least-squares algorithm. The uncertainty of spatial error estimates was found to be inversely proportional to the square root of the number of landmark point pairs and directly proportional to the standard deviation of the spatial errors. Using the statistical properties of this data, we performed sample size calculations to estimate the average spatial accuracy of each algorithm with 95% confidence intervals within a 0.5 mm range. For the optical flow and moving least-squares algorithms, the required sample sizes were 1050 and 36, respectively. Comparative evaluation based on fewer than the required validation landmarks results in misrepresentation of the relative spatial accuracy. This study demonstrates that landmark pairs can be used to assess DIR spatial accuracy within a narrow uncertainty range.
Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance. R eproducibility of scientific research is critical to the scientific endeavor, so the apparent lack of reproducibility threatens the credibility of the scientific enterprise (e.g., refs. 1 and 2). Unfortunately, concern over the nonreproducibility of scientific studies has become so pervasive that a Web site, Retraction Watch, has been established to monitor the large number of retracted papers, and methodology for detecting flawed studies has developed nearly into a scientific discipline of its own (e.g., refs. 3-9).Nonreproducibility in scientific studies can be attributed to a number of factors, including poor research designs, flawed statistical analyses, and scientific misconduct. The focus of this article, however, is the resolution of that component of the problem that can be attributed simply to the routine use of widely accepted statistical testing procedures.Claims of novel research findings are generally based on the outcomes of statistical hypothesis tests, which are normally conducted under one of two statistical paradigms. Most commonly, hypothesis tests are performed under the classical, or frequentist, paradigm. In this approach, a "significant" finding is declared when the value of a test statistic exceeds a specified threshold. Values of the test statistic above this threshold define the test's rejection region. The significance level α of the test is defined to be the maximum probability that the test statistic falls into the rejection region when the null hypothesis-representing standard theory-is true. By long-standing convention (10), a value of α = 0.05 defines a significant finding. The P value from a classical test is the maximum probability of observing a test statistic as extreme, or more extreme, than the value that was actually observed, given that the null hypothesis is true.The second approach for performing hypothesis tests follows from the Bayesian paradigm and focuses on the calculation of the posterior odds that the alternative hypotheses is true, give...
Purpose Adoptive cell therapy (ACT) using autologous tumor-infiltrating lymphocytes (TIL) is a promising treatment for metastatic melanoma unresponsive to conventional therapies. We report here on the results of an ongoing Phase II clinical trial testing the efficacy of ACT using TIL in metastatic melanoma patients and the association of specific patient clinical characteristics and the phenotypic attributes of the infused TIL with clinical response. Experimental Design Altogether, 31 transiently lymphodepleted patients were treated with their expanded TIL followed by two cycles of high-dose (HD) IL-2 therapy. The effects of patient clinical features and the phenotypes of the T-cells infused on clinical response were determined. Results Overall, 15/31 (48.4%) patients had an objective clinical response using immune-related response criteria (irRC), with two patients (6.5%) having a complete response. Progression-free survival of >12 months was observed for 9/15 (60%) of the responding patients. Factors significantly associated with objective tumor regression included a higher number of TIL infused, a higher proportion of CD8+ T-cells in the infusion product, a more differentiated effector phenotype of the CD8+ population and a higher frequency of CD8+ T-cells co-expressing the negative costimulation molecule “B- and T-lymphocyte attenuator” (BTLA). No significant difference in telomere lengths of TIL between responders and non-responders was identified. Conclusion These results indicate that immunotherapy with expanded autologous TIL is capable of achieving durable clinical responses in metastatic melanoma patients and that CD8+ T-cells in the infused TIL, particularly differentiated effectors cells and cells expressing BTLA, are associated with tumor regression.
Summary.We examine philosophical problems and sampling deficiencies that are associated with current Bayesian hypothesis testing methodology, paying particular attention to objective Bayes methodology. Because the prior densities that are used to define alternative hypotheses in many Bayesian tests assign non-negligible probability to regions of the parameter space that are consistent with null hypotheses, resulting tests provide exponential accumulation of evidence in favour of true alternative hypotheses, but only sublinear accumulation of evidence in favour of true null hypotheses. Thus, it is often impossible for such tests to provide strong evidence in favour of a true null hypothesis, even when moderately large sample sizes have been obtained. We review asymptotic convergence rates of Bayes factors in testing precise null hypotheses and propose two new classes of prior densities that ameliorate the imbalance in convergence rates that is inherited by most Bayesian tests. Using members of these classes, we obtain analytic expressions for Bayes factors in linear models and derive approximations to Bayes factors in large sample settings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.