Let W be a random variable with mean zero and variance σ 2 .
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D2 statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D2 statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D2 word count statistic, which we call D2S and D2∗. For D2S, which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D2∗, outperforms D2S in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D2∗, we cannot provide a closed form for power calculations.
We compute explicit bounds in the normal and chi-square approximations of multilinear homogenous sums (of arbitrary order) of general centered independent random variables with unit variance. In particular, we show that chaotic random variables enjoy the following form of universality: (a) the normal and chi-square approximations of any homogenous sum can be completely characterized and assessed by first switching to its Wiener chaos counterpart, and (b) the simple upper bounds and convergence criteria available on the Wiener chaos extend almost verbatim to the class of homogeneous sums. . This reprint differs from the original in pagination and typographic detail. 1 2 I. NOURDIN, G. PECCATI AND G. REINERTOur findings partially rely on the notion of "low influences" (see again [10]) for real-valued functions defined on product spaces. As indicated by the title, we regard the two properties (a) and (b) as an instance of the universality phenomenon, according to which most information about large random systems (such as the "distance to Gaussian" of nonlinear functionals of large samples of independent random variables) does not depend on the particular distribution of the components. Other recent examples of the universality phenomenon appear in the already quoted paper [10], as well as in the Tao-Vu proof of the circular law for random matrices, as detailed in [31] (see also the Appendix to [31] by Krishnapur). Observe that, in Section 7, we will prove analogous results for the multivariate normal approximation of vectors of homogenous sums of possibly different orders. In a further work by the first two authors (see [14]) the results of the present paper are applied in order to deduce universal Gaussian fluctuations for traces associated with non-Hermitian matrix ensembles.
In this paper we establish a multivariate exchangeable pairs approach within the framework of Stein's method to assess distributional distances to potentially singular multivariate normal distributions. By extending the statistics into a higher-dimensional space, we also propose an embedding method which allows for a normal approximation even when the corresponding statistics of interest do not lend themselves easily to Stein's exchangeable pairs approach. To illustrate the method, we provide the examples of runs on the line as well as double-indexed permutation statistics.Heuristically, (1.1) can be understood as a linear regression condition. If (W, W ′ ) were bivariate normal with correlation ρ, then
In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein's method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, confidence intervals for tests.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.