Deterministic Sparse Column Based Matrix Reconstruction via Greedy Approximation of SVD

Çivril, Ali; Magdon‐Ismail, Malik

doi:10.1007/978-3-540-92182-0_38

Cited by 6 publications

(16 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this and other reasons, a common task in genetics and other areas of data analysis is the following: given an input data matrix A and a parameter k, find the best subset of exactly k actual DNA SNPs or actual genes, i.e., actual columns or rows from A, to use to cluster individuals, reconstruct biochemical pathways, reconstruct signal, perform classification or inference, etc. Unfortunately, common formalizations of this algorithmic problem-including looking for the k actual columns that capture the largest amount of information or variance in the data or that are maximally uncorrelated-lead to intractable optimization problems [22,23]. For example, consider the so-called Column Subset Selection Problem [24]: given as input an arbitrary m × n matrix A and a rank parameter k, choose the set of exactly k columns of A s.t.…”

Section: Motivating Scientific Applicationsmentioning

confidence: 99%

“…That being said, running the risk of such a failure might be acceptable if one can efficiently couple to a diagnostic to check for such a failure, and if one can then correct for it by choosing more samples if necessary. The best numerical implementations of randomized matrix algorithms for low-rank matrix approximation do just this, and the strongest results in terms of minimizing p take advantage of Condition (22) in a somewhat different way than was originally used in the analysis of the CSSP [14]. For example, rather than choosing O(k log k) dimensions and then filtering them through exactly k dimensions, as the relative-error random sampling and relative-error random projection algorithms do, one can choose some number ℓ of dimensions and project onto a k ′ -dimensional subspace, where k < k ′ ≤ ℓ, while exploiting Condition (22) to bound the error, as appropriate for the computational environment at hand [14].…”

Section: An Improved Random Projection Algorithmmentioning

confidence: 99%

“…25 Of course, there is randomization inside 22 Stating this in terms of the singular vectors is a convenience, but it can create confusion. In particular, although computing the SVD is sufficient, it is by no means necessary-here, U can be any orthogonal matrix spanning the column space of A [21].…”

mentioning

confidence: 99%

“…35 Of course, this should not be completely unexpected, given that Condition (22) shows that the behavior of algorithms depends on the interaction between different subspaces associated with the input matrix A. When stronger assumptions are made about the data, stronger bounds can often be obtained.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Randomized Algorithms for Matrices and Data

Mahoney¹

2012

Chapman &Amp; Hall/CRC Data Mining and Knowledge Discovery Series

335

373

View full text Add to dashboard Cite

Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, largely since matrices are popular structures with which to model data drawn from a wide range of application domains, and this work was performed by individuals from many different research communities. While the most obvious benefit of randomization is that it can lead to faster algorithms, either in worst-case asymptotic theory and/or numerical implementation, there are numerous other benefits that are at least as important. For example, the use of randomization can lead to simpler algorithms that are easier to analyze or reason about when applied in counterintuitive settings; it can lead to algorithms with more interpretable output, which is of interest in applications where analyst time rather than just computational time is of interest; it can lead implicitly to regularization and more robust output; and randomized algorithms can often be organized to exploit modern computational architectures better than classical numerical methods.This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Throughout this review, an emphasis will be placed on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data applications. Crucial in this context is the connection with the concept of statistical leverage. This concept has long been used in statistical regression diagnostics to identify outliers; and it has recently proved crucial in the development of improved worst-case matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists. This connection arises naturally when one explicitly decouples the effect of randomization in these matrix algorithms from the underlying linear algebraic structure. This decoupling also permits much finer control in the application of randomization, as well as the easier exploitation of domain knowledge.Most of the review will focus on random sampling algorithms and random projection algorithms for versions of the linear least-squares problem and the low-rank matrix approximation problem. These two problems are fundamental in theory and ubiquitous in practice. Randomized methods solve these problems by constructing and operating on a randomized sketch of the input matrix Afor random sampling methods, the sketch consists of a small number of carefully-sampled and rescaled columns/rows of A, while for random projection methods, the sketch consists of a small number of linear combinations of the columns/rows of A. Depending on the specifics of the situation, when compared with the best previously-existing deterministic algorithms, the resulting randomized algorithms have worst-case running time that is asympt...

show abstract

Section: Motivating Scientific Applicationsmentioning

confidence: 99%

Section: An Improved Random Projection Algorithmmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Randomized Algorithms for Matrices and Data

Mahoney¹

2012

Chapman &Amp; Hall/CRC Data Mining and Knowledge Discovery Series

335

373

View full text Add to dashboard Cite

show abstract

“…In this implementation, the MATLAB qr function is first used to calculate the QR decomposition with column pivoting and then the columns are swapped using the criterion specified by Gu and Eisenstat [31]. 8 -ApproxSVD: is the sparse approximation of Singular Value Decomposition (SVD) [9,10]. The algorithm was implemented in MATLAB.…”

Section: Evaluation Of Centralized Greedy Cssmentioning

confidence: 99%

Greedy column subset selection for large-scale data sets

et al. 2014

View full text Add to dashboard Cite

In today's information systems, the availability of massive amounts of data necessitates the development of fast and accurate algorithms to summarize these data and represent them in a succinct format. One crucial problem in big data analytics is the selection of representative instances from large and massively distributed data, which is formally known as the Column Subset Selection problem. The solution to this problem enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. This paper presents a fast and accurate greedy algorithm for large-scale column subset selection. The algorithm minimizes an objective function, which measures the reconstruction error of the data matrix based on the subset of selected columns. The paper first presents a centralized greedy algorithm for column subset selection, which depends on a novel recursive formula for calculating the reconstruction error of the data matrix. The paper then presents a MapReduce algorithm, which selects a few representative columns from a matrix whose columns are massively distributed across several commodity machines. The algorithm first A preliminary version of this paper appeared as [26]. This work was completed while the second author was at the University of Waterloo.123 A. K. Farahat et al. learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.

show abstract

Reduced order and surrogate models for gravitational waves

Tiglio

Villanueva

2022

Living Rev Relativ

View full text Add to dashboard Cite

We present an introduction to some of the state of the art in reduced order and surrogate modeling in gravitational-wave (GW) science. Approaches that we cover include principal component analysis, proper orthogonal (singular value) decompositions, the reduced basis approach, the empirical interpolation method, reduced order quadratures, and compressed likelihood evaluations. We divide the review into three parts: representation/compression of known data, predictive models, and data analysis. The targeted audience is practitioners in GW science, a field in which building predictive models and data analysis tools that are both accurate and fast to evaluate, especially when dealing with large amounts of data and intensive computations, are necessary yet can be challenging. As such, practical presentations and, sometimes, heuristic approaches are here preferred over rigor when the latter is not available. This review aims to be self-contained, within reasonable page limits, with little previous knowledge (at the undergraduate level) requirements in mathematics, scientific computing, and related disciplines. Emphasis is placed on optimality, as well as the curse of dimensionality and approaches that might have the promise of beating it. We also review most of the state of the art of GW surrogates. Some numerical algorithms, conditioning details, scalability, parallelization and other practical points are discussed. The approaches presented are to a large extent non-intrusive (in the sense that no differential equations are invoked) and data-driven and can therefore be applicable to other disciplines. We close with open challenges in high dimension surrogates, which are not unique to GW science.

show abstract

Deterministic Sparse Column Based Matrix Reconstruction via Greedy Approximation of SVD

Cited by 6 publications

References 25 publications

Randomized Algorithms for Matrices and Data

Randomized Algorithms for Matrices and Data

Greedy column subset selection for large-scale data sets

Reduced order and surrogate models for gravitational waves

Contact Info

Product

Resources

About