2011
DOI: 10.1137/100804139
|View full text |Cite
|
Sign up to set email alerts
|

An Algorithm for the Principal Component Analysis of Large Data Sets

Abstract: Recently popularized randomized methods for principal component analysis (PCA) efficiently and reliably produce nearly optimal accuracy -even on parallel processors -unlike the classical (deterministic) alternatives. We adapt one of these randomized methods for use with data sets that are too large to be stored in random-access memory (RAM). (The traditional terminology is that our procedure works efficiently out-of-core.) We illustrate the performance of the algorithm via several numerical examples. For examp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
225
0
3

Year Published

2012
2012
2022
2022

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 246 publications
(228 citation statements)
references
References 11 publications
0
225
0
3
Order By: Relevance
“…That is, whereas the previous bounds held for any input, here the proper choice for the oversampling factor p can be quite sensitive to the input matrix. For example, when parameterized this way, p could depend on the size of the matrix dimensions, the decay properties of the spectrum, and the particular choice made for the random projection matrix [160,161,82,162,14,163]. Moreover, for worst-case input matrices, such a procedure may fail.…”
Section: An Improved Random Projection Algorithmmentioning
confidence: 99%
See 1 more Smart Citation
“…That is, whereas the previous bounds held for any input, here the proper choice for the oversampling factor p can be quite sensitive to the input matrix. For example, when parameterized this way, p could depend on the size of the matrix dimensions, the decay properties of the spectrum, and the particular choice made for the random projection matrix [160,161,82,162,14,163]. Moreover, for worst-case input matrices, such a procedure may fail.…”
Section: An Improved Random Projection Algorithmmentioning
confidence: 99%
“…This has the effect of speeding up the decay of the spectrum while leaving the singular vectors unchanged, and it is observed in [162,14] that q = 2 or q = 4 is often sufficient for certain data matrices of interest. This algorithm was analyzed in greater detail for the case of Gaussian random matrices in [14], and an out-of-core implementation (meaning, appropriate for data sets that are too large to be stored in RAM) of it was presented in [163].…”
mentioning
confidence: 99%
“…The multivariate analysis was based on the Euclidean distance matrix calculated from log-transformed protein spot intensity data, with pair-wise tests conducted between groups. PCA is one of the most commonly used techniques for an in-depth search of predominant proteomic patterns in a large dataset (Halko et al, 2011).…”
Section: Proteome Analysismentioning
confidence: 99%
“…For larger networks with millions of nodes, randomized algorithms may be used. Halko et al [24] have used a randomized version of the block Lanczos method for computing SVD of large matrices having time complexity O(ikN a + i 2 k 2 n), where i ≤ 2, k is the number of principal components to be computed, N a is the number of non-zero entries in the matrix.…”
Section: Complexity Analysis Of the Proposed Algorithmmentioning
confidence: 99%