Bump-hunting or mode identification is a fundamental problem that arises in almost every scientific field of data-driven discovery. Surprisingly, very few data modeling tools are available for automatic (not requiring manual case-by-case investigation), objective (not subjective), and nonparametric (not based on restrictive parametric model assumptions) mode discovery, which can scale to large data sets. This article introduces LPMode-an algorithm based on a new theory for detecting multimodality of a probability density. We apply LPMode to answer important research questions arising in various fields from environmental science, ecology, econometrics, analytical chemistry to astronomy and cancer genomics.
This paper formulates a penalized empirical likelihood (PEL) method for inference on the population mean when the dimension of the observations may grow faster than the sample size. Asymptotic distributions of the PEL ratio statistic is derived under different component-wise dependence structures of the observations, namely, (i) non-Ergodic, (ii) long-range dependence and (iii) short-range dependence. It follows that the limit distribution of the proposed PEL ratio statistic can vary widely depending on the correlation structure, and it is typically different from the usual chi-squared limit of the empirical likelihood ratio statistic in the fixed and finite dimensional case. A unified subsampling based calibration is proposed, and its validity is established in all three cases, (i)-(iii). Finite sample properties of the method are investigated through a simulation study.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1040 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
There is an overwhelmingly large literature and algorithms already available on "large-scale inference problems" based on different modeling techniques and cultures. Our primary goal in this article is not to add one more new methodology to the existing toolbox but instead (i) to clarify the mystery how these different simultaneous inference methods are connected, (ii) to provide an alternative more intuitive derivation of the formulas that leads to simpler expressions in order (iii) to develop a unified algorithm for practitioners. A detailed discussion on representation, estimation, inference, and model selection is given. Applications to a variety of real and simulated datasets show promise. We end with several future research directions.
Summary High-dimensional $k$-sample comparison is a common task in applications. We construct a class of easy-to-implement distribution-free tests based on new nonparametric tools and unexplored connections with spectral graph theory. The test is shown to have various desirable properties and a characteristic exploratory flavour that has practical consequences for statistical modelling. Numerical examples show that the proposed method works surprisingly well across a broad range of realistic situations.
Consider a big data multiple testing task, where, due to storage and computational bottlenecks, one is given a very large collection of p-values by splitting into manageable chunks and distributing over thousands of computer nodes. This paper is concerned with the following question: How can we find the full data multiple testing solution by operating completely independently on individual machines in parallel, without any data exchange between nodes? This version of the problem tends naturally to arise in a wide range of data-intensive science and industry applications whose methodological solution has not appeared in the literature to date; therefore, we feel it is necessary to undertake such analysis. Based on the nonparametric functional statistical viewpoint of large-scale inference, started in Mukhopadhyay (2016), this paper furnishes a new computing model that brings unexpected simplicity to the design of the algorithm which might otherwise seem daunting using classical approach and notations.°°°°M achine 1 Machine 2 Machine K Figure 1: The data structure and setting of decentralized large-scale inference problem. Massive collection of p-values distributed across large number of computer nodes. be unrealistic due to huge volume (too expensive to store), computational bottleneck † , and possible privacy restrictions. Driven by practical need, the interest for designing Decentralized Large-Scale Inference Engine has enormously increased in the last few years, due to their ability to scale cost effectively as the data volume continued to increase by leveraging modern distributed storage and computing environments. There is, however, apparently no explicit algorithm currently available in the literature to tackle this innocent-looking problem of breaking the multiple testing computation into many pieces, each of which can be processed completely independently on individual machines in parallel.Remark 1. To get a glimpse of the challenge, consider a specific multiple testing method, say the Benjamini Hochberg's (BH) FDR controlling procedure, which starts by calculating the global -rank of each p-value:The computation of global-ranks, from the partitioned p-values, without any communications between the machines, is a highly non-trivial problem. Difficulty with similar caliber also arises in implementing local false discovery type algorithms. † BH (Benjamini and Hochberg, 1995) and HC (Donoho and Jin, 2004) procedures start by ordering the p-values from smallest to largest incurring at least O(N log N ) computational cost and other method like local fdr (Efron et al., 2001) is of even greater complexity O(N 2 ), thereby making legacy multiple testing algorithms infeasible for such massive scale inference problems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.