Abstract. We consider the problem of efficiently estimating multivariate densities and their modes for moderate dimensions and an abundance of data. We propose polynomial histograms to solve this estimation problem. We present first‐ and second‐order polynomial histogram estimators for a general d‐dimensional setting. Our theoretical results include pointwise bias and variance of these estimators, their asymptotic mean integrated square error (AMISE), and optimal binwidth. The asymptotic performance of the first‐order estimator matches that of the kernel density estimator, while the second order has the faster rate of O(n−6/(d+6)). For a bivariate normal setting, we present explicit expressions for the AMISE constants which show the much larger binwidths of the second order estimator and hence also more efficient computations of multivariate densities. We apply polynomial histogram estimators to real data from biotechnology and find the number and location of modes in such data.
Many methods have been described for automated clustering analysis of complex flow cytometry data, but so far the goal to efficiently estimate multivariate densities and their modes for a moderate number of dimensions and potentially millions of data points has not been attained. We have devised a novel approach to describing modes using second order polynomial histogram estimators (SOPHE). The method divides the data into multivariate bins and determines the shape of the data in each bin based on second order polynomials, which is an efficient computation. These calculations yield local maxima and allow joining of adjacent bins to identify clusters. The use of second order polynomials also optimally uses wide bins, such that in most cases each parameter (dimension) need only be divided into 4-8 bins, again reducing computational load. We have validated this method using defined mixtures of up to 17 fluorescent beads in 16 dimensions, correctly identifying all populations in data files of 100,000 beads in <10 s, on a standard laptop. The method also correctly clustered granulocytes, lymphocytes, including standard T, B, and NK cell subsets, and monocytes in 9-color stained peripheral blood, within seconds. SOPHE successfully clustered up to 36 subsets of memory CD4 T cells using differentiation and trafficking markers, in 14-color flow analysis, and up to 65 subpopulations of PBMC in 33-dimensional CyTOF data, showing its usefulness in discovery research. SOPHE has the potential to greatly increase efficiency of analysing complex mixtures of cells in higher dimensions. V C 2015 International Society for Advancement of Cytometry Key terms data analysis; clustering; high dimensions; complex data LYMPHOCYTES were originally viewed by light microscopy as small homogeneous round cells with minimal cytoplasm (1). Multiparameter flow cytometry, using currently available lasers, monoclonal antibodies and fluorochromes (2), has helped reveal the extremely complex heterogeneity of differentiated T and B cells with diverse immunological properties (3,4). It is possible to analyze 18 or more markers simultaneously on individual lymphocytes (2), and the development of additional labels will greatly increase this number. Study of 40 or more markers has already been achieved by the use of transition element isotopes as chelated antibody tags and mass cytometry, instead of fluorochromes (5,6).Addition of more parameters is an important goal of flow cytometric analysis of lymphocytes, because, paradoxically, instead of simply increasing complexity of the results, they can actually reveal important subsets with much greater clarity. This was originally best shown by the combination of 2 light scatter and 2 fluorescence parameters, CD45 and CD14, to clearly separate lymphocytes in blood from monocytes, granulocytes and red cell debris (7). In our experience, an important example is to first separate na€ ıve and memory T cells using CD45RA/CD45RO to measure expres-
The D 2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D 2 may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D 2 * and D 2 c. We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D 2 and D 2 c, and to a somewhat lesser extent D 2 *, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.