An Algorithm for the Principal Component Analysis of Large Data Sets

Halko, Nathan; Martinsson, Per-Gunnar; Shkolnisky, Yoel; Tygert, Mark

doi:10.1137/100804139

Cited by 246 publications

(228 citation statements)

References 11 publications

Supporting

Mentioning

225

Contrasting

Unclassified

Order By: Relevance

“…That is, whereas the previous bounds held for any input, here the proper choice for the oversampling factor p can be quite sensitive to the input matrix. For example, when parameterized this way, p could depend on the size of the matrix dimensions, the decay properties of the spectrum, and the particular choice made for the random projection matrix [160,161,82,162,14,163]. Moreover, for worst-case input matrices, such a procedure may fail.…”

Section: An Improved Random Projection Algorithmmentioning

confidence: 99%

“…This has the effect of speeding up the decay of the spectrum while leaving the singular vectors unchanged, and it is observed in [162,14] that q = 2 or q = 4 is often sufficient for certain data matrices of interest. This algorithm was analyzed in greater detail for the case of Gaussian random matrices in [14], and an out-of-core implementation (meaning, appropriate for data sets that are too large to be stored in RAM) of it was presented in [163].…”

mentioning

confidence: 99%

See 1 more Smart Citation

Randomized Algorithms for Matrices and Data

Mahoney¹

2012

Chapman &Amp; Hall/CRC Data Mining and Knowledge Discovery Series

335

373

View full text Add to dashboard Cite

Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, largely since matrices are popular structures with which to model data drawn from a wide range of application domains, and this work was performed by individuals from many different research communities. While the most obvious benefit of randomization is that it can lead to faster algorithms, either in worst-case asymptotic theory and/or numerical implementation, there are numerous other benefits that are at least as important. For example, the use of randomization can lead to simpler algorithms that are easier to analyze or reason about when applied in counterintuitive settings; it can lead to algorithms with more interpretable output, which is of interest in applications where analyst time rather than just computational time is of interest; it can lead implicitly to regularization and more robust output; and randomized algorithms can often be organized to exploit modern computational architectures better than classical numerical methods.This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Throughout this review, an emphasis will be placed on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data applications. Crucial in this context is the connection with the concept of statistical leverage. This concept has long been used in statistical regression diagnostics to identify outliers; and it has recently proved crucial in the development of improved worst-case matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists. This connection arises naturally when one explicitly decouples the effect of randomization in these matrix algorithms from the underlying linear algebraic structure. This decoupling also permits much finer control in the application of randomization, as well as the easier exploitation of domain knowledge.Most of the review will focus on random sampling algorithms and random projection algorithms for versions of the linear least-squares problem and the low-rank matrix approximation problem. These two problems are fundamental in theory and ubiquitous in practice. Randomized methods solve these problems by constructing and operating on a randomized sketch of the input matrix Afor random sampling methods, the sketch consists of a small number of carefully-sampled and rescaled columns/rows of A, while for random projection methods, the sketch consists of a small number of linear combinations of the columns/rows of A. Depending on the specifics of the situation, when compared with the best previously-existing deterministic algorithms, the resulting randomized algorithms have worst-case running time that is asympt...

show abstract

Section: An Improved Random Projection Algorithmmentioning

confidence: 99%

mentioning

confidence: 99%

Randomized Algorithms for Matrices and Data

Mahoney¹

2012

Chapman &Amp; Hall/CRC Data Mining and Knowledge Discovery Series

335

373

View full text Add to dashboard Cite

show abstract

“…The multivariate analysis was based on the Euclidean distance matrix calculated from log-transformed protein spot intensity data, with pair-wise tests conducted between groups. PCA is one of the most commonly used techniques for an in-depth search of predominant proteomic patterns in a large dataset (Halko et al, 2011).…”

Section: Proteome Analysismentioning

confidence: 99%

Proteomic response of marine invertebrate larvae to ocean acidification and hypoxia during metamorphosis and calcification

Mukherjee

Wong

Chandramouli

et al. 2013

Journal of Experimental Biology

View full text Add to dashboard Cite

SUMMARYCalcifying marine invertebrates with complex life cycles are particularly at risk to climate changes as they undergo an abrupt ontogenetic shift during larval metamorphosis. Although our understanding of the larval response to climate changes is rapidly advancing, the proteome plasticity involved in a compensatory response to climate change is still unknown. In this study, we investigated the proteomic response of metamorphosing larvae of the tubeworm Hydroides elegans, challenged with two climate change stressors, ocean acidification (OA; pH 7.6) and hypoxia (HYP; 2.8 mg O 2 l −1 ), and with both combined. Using a twodimensional gel electrophoresis (2-DE)-based approach coupled with mass spectrometry, we found that climate change stressors did not affect metamorphosis except under OA, but altered the larval proteome and phosphorylation status. Metabolism and various stress and calcification-related proteins were downregulated in response to OA. In OA and HYP combined, HYP restored the expression of the calcification-related proteins to the control levels. We speculate that mild HYP stress could compensate for the negative effects of OA. This study also discusses the potential functions of selected proteins that might play important roles in larval acclimation and adaption to climate change.Supplementary material available online at http://jeb.biologists.org/lookup/suppl

show abstract

“…For larger networks with millions of nodes, randomized algorithms may be used. Halko et al [24] have used a randomized version of the block Lanczos method for computing SVD of large matrices having time complexity O(ikN a + i 2 k 2 n), where i ≤ 2, k is the number of principal components to be computed, N a is the number of non-zero entries in the matrix.…”

Section: Complexity Analysis Of the Proposed Algorithmmentioning

confidence: 99%

Using Geodesic Space Density Gradients for Network Community Detection

Mahmood

Small²,

Al-Maadeed

et al. 2017

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Using geodesic space density gradients for network community detection. IEEE Transactions on Knowledge and Data Engineering, 294 (4). pp. 921-935. Permanent WRAP URL:http://wrap.warwick.ac.uk/84012 Copyright and reuse:The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.Copies of full items can be used for personal research or study, educational, or not-for profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.Publisher's statement: "© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works." A note on versions:The version presented here may differ from the published version or, version of record, if you wish to cite this item you are advised to consult the publisher's version. Please see the 'permanent WRAP URL' above for details on accessing the published version and note that access may require a subscription.For more information, please contact the WRAP Team at: wrap@warwick.ac.uk 1 Using Geodesic Space Density Gradients for Network Community DetectionArif Mahmood, Michael Small, Somaya Ali Al-Maadeed, and Nasir Rajpoot Abstract-Many real world complex systems naturally map to network data structures instead of geometric spaces because the only available information is the presence or absence of a link between two entities in the system. To enable data mining techniques to solve problems in the network domain, the nodes need to be mapped to a geometric space. We propose this mapping by representing each network node with its geodesic distances from all other nodes. The space spanned by the geodesic distance vectors is the geodesic space of that network. Position of different nodes in the geodesic space encode the network structure. In this space, considering a continuous density field induced by each node, density at a specific point is the summation of density fields induced by all nodes. We drift each node in the direction of positive density gradient using an iterative algorithm till each node reaches a local maximum. Due to the network structure captured by this space, the nodes that drift to the same region of space belong to the same communities in the original network. We use the direction of movement and final position of each...

show abstract

An Algorithm for the Principal Component Analysis of Large Data Sets

Cited by 246 publications

References 11 publications

Randomized Algorithms for Matrices and Data

Randomized Algorithms for Matrices and Data

Proteomic response of marine invertebrate larvae to ocean acidification and hypoxia during metamorphosis and calcification

Using Geodesic Space Density Gradients for Network Community Detection

Contact Info

Product

Resources

About