Dimension reduction methods are commonly applied to highthroughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available at https://github.com/linzx06/AC-PCA. dimension reduction | confounding variation | transcriptome D imension reduction methods, such as multidimensional scaling (MDS) and principal component analysis (PCA), are commonly applied in high-throughput biological datasets to visualize data in a low-dimensional space, identify dominant patterns, and extract relevant features (1-6). MDS aims to place each sample in a lower-dimensional space such that the betweensample distances are preserved as much as possible (7). PCA seeks the linear combinations of the original variables such that the derived variables capture maximal variance (8). One advantage of PCA is that the principal components (PCs) are more interpretable by checking the loadings of the variables.Confounding factors, either biological or technical in origin, are commonly observed in high-throughput biological experiments. Various methods have been proposed to estimate the confounding variation, for example, regression models on known confounding factors (9) and factor models and surrogate vector analysis for unobserved confounding factors (10-15). However, limited work has been done in the context of dimension reduction. Confounding variation can affect PC-based visualization of the data points because it may obscure the desired biological variation, and it can also affect the loading of the variables in the PCs.Here we extend PCA to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We introduce a class of penalty functions in PCA, which encourages the PCs to be invariant to the confounding variation. We demonstrate the performance of AC-PCA through its application to a human brain development exon array dataset (4), a model organism ENCODE (modENCODE) RNA sequencing (RNA-Seq) dataset (16, 17), and simulated data. We also implemented AC-PCA with sparsity constraints to enable variable/gene selection and better interpretation of the PCs.Results AC-PCA in a General Form. Let X denote the N × p data matrix, where N is the number of observations and p is...
We prove a sharp Bernstein inequality for general-state-space and not necessarily reversible Markov chains. It is sharp in the sense that the variance proxy term is optimal. Our result covers the classical Bernstein's inequality for independent random variables as a special case.
In this paper, we propose the hard thresholding regression (HTR) for estimating high‐dimensional sparse linear regression models. HTR uses a two‐stage convex algorithm to approximate the ℓ0‐penalized regression: The first stage calculates a coarse initial estimator, and the second stage identifies the oracle estimator by borrowing information from the first one. Theoretically, the HTR estimator achieves the strong oracle property over a wide range of regularization parameters. Numerical examples and a real data example lend further support to our proposed methodology.
High-dimensional linear regression has been intensively studied in the community of statistics in the last two decades. For the convenience of theoretical analyses, classical methods usually assume independent observations and sub-Gaussian-tailed errors. However, neither of them hold in many real highdimensional time-series data. Recently [Sun, Zhou, Fan, 2019, J. Amer. Stat. Assoc., in press] proposed Adaptive Huber Regression (AHR) to address the issue of heavy-tailed errors. They discover that the robustification parameter of the Huber loss should adapt to the sample size, the dimensionality, and the moments of the heavy-tailed errors. We progress in a vertical direction and justify AHR on dependent observations. Specifically, we consider an important dependence structure -Markov dependence. Our results show that the Markov dependence impacts on the adaption of the robustification parameter and the estimation of regression coefficients in the way that the sample size should be discounted by a factor depending on the spectral gap of the underlying Markov chain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.