In this paper, we develop a local rank correlation measure which quantifies the performance of dimension reduction methods. The local rank correlation is easily interpretable, and robust against the extreme skewness of nearest neighbor distributions in high dimensions. Some benchmark datasets are studied. We find that the local rank correlation closely corresponds to our visual interpretation of the quality of the output. In addition, we demonstrate that the local rank correlation is useful in estimating the intrinsic dimensionality of the original data, and in selecting a suitable value of tuning parameters used in some algorithms.The local rank correlations ρ O J (i) or τ O J (i) measure the similarity, in terms of output errors, between the corresponding neighborhoods, N I J (i) and N O J (i). Similarly, we can define local rank correlations to measure the input error. Definition 2. Local rank correlation for input error :Given an input dataset X and a low-dimensional representation Y, the local Spearman correla-,The overall goodness measure of a given method ψ and input data X is defined aswhere Γ I J can be either ρ I J , or τ I J . RemarkThe proposed local rank correlations have some nice properties. The higher values of local measures Γ I J (i) and Γ O J (i) indicate a higher degree of similarity between the original data and the lowdimensional configuration in the neighborhood of case i, while values close to 0, or negative values indicate that low-dimensional configuration fails to preserve the local structure of the input data in certain neighborhoods. Two special situations are:if all the ranking relationships of the observed data X in the neighborhood of case i are preserved exactly in the corresponding neighborhood in the output data Y.• The expected values E Γ I J (i) and E Γ O J (i) are both zero, for any case i, where the output Y is generated by an algorithm which is stochastically independent of the input data X.These two facts hold for both local Spearman and Kendall correlations. Notice that the second situation is worse than we can have in practice. Moreover, the local measures Γ I J (i) and Γ O J (i), can achieve negative values for some i. Nevertheless, the overall goodness measures G O J and G I J , for sensible algorithms, will take values between 0 and 1. We remind the reader that the use of ranks is to protect against non-normality and extreme skewness of distance distributions in high dimensions.The computational complexity is also of interest. To calculate the goodness measure, we first construct the J-nearest neighbor graph for both X and Y. This step scales as O(n 2 p). In the next step, we calculate the local rank correlation in each neighborhood. This scales (in each neighborhood) as O(J) for Spearman ρ J and O(J log J) for Kendall τ J . Therefore, since J ≤ n, 7 the total complexity of calculating G I J (or G O J ) for ρ J scales as O(n 2 p). The total complexity of calculating G I J (or G O J ) for τ J scales as O(n 2 p + nJ log J). To use the proposed goodness measure G J for a...
Information in the data often has far fewer degrees of freedom than the number of variables encoding the data. Dimensionality reduction attempts to reduce the number of variables used to describe the data. In this article, we shall survey some dimension reduction techniques that are robust. We consider linear dimension reduction first and describe robust principal component analysis (PCA) using three approaches. The first approach uses a singular value decomposition of a robust covariance matrix. The second approach employs robust measures of dispersion to realize PCA as a robust projection pursuit. The third approach uses a low‐rank plus sparse decomposition of the data matrix. We also survey robust approaches to nonlinear dimension reduction under a unifying framework of kernel PCA. By using a kernel trick, the robust methods available for PCA can be extended to nonlinear cases. WIREs Comput Stat 2015, 7:63–69. doi: 10.1002/wics.1331 This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Manifold Learning Statistical and Graphical Methods of Data Analysis > Robust Methods
Class imbalanced datasets are common in real-world applications that range from credit card fraud detection to rare disease diagnostics. Several popular classification algorithms assume that classes are approximately balanced, and hence build the accompanying objective function to maximize an overall accuracy rate. In these situations, optimizing the overall accuracy will lead to highly skewed predictions towards the majority class. Moreover, the negative business impact resulting from false positives (positive samples incorrectly classified as negative) can be detrimental. Many methods have been proposed to address the class imbalance problem, including methods such as oversampling, under-sampling and cost-sensitive methods. In this paper, we consider the over-sampling method, where the aim is to augment the original dataset with synthetically created observations of the minority classes. In particular, inspired by the recent advances in generative modelling techniques (e.g., Variational Inference and Generative Adversarial Networks), we introduce a new oversampling technique based on variational autoencoders. Our experiments show that the new method is superior in augmenting datasets for downstream classification tasks when compared to traditional oversampling methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.