Less is More: Compact Matrix Decomposition for Large Sparse Graphs

Sun, Yeali S.; Xie, Yinglian; Zhang, Hui; Faloutsos, Christos

doi:10.1137/1.9781611972771.33

Cited by 85 publications

(78 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare LS-DCUR against CUR-L2 (Euclidean-norm based selection), CUR-SL (statistical leverage based selection), and Greedy (a recent deterministic selection method that was shown to exceed various other methods [7]). To provide a fair comparison, we incorporate several extensions into the importance sampling based methods: both CUR-L2 and CUR-SL use the extensions proposed for CMD [26] and, in both cases, we sample exactly the same number of unique rows and columns as in the case of LS-DCUR and Greedy (double selections do not count as a selected row or column). For methods requiring computation of the top-k singular vectors (CUR-SL, Greedy), we specify a reasonable k. As setting it to the actual number of sampled rows and columns is not advisable, we follow the suggestion of [22] and over-sampled k; various experimental runs show that setting k to ≈ 4 5 of the number of row and column samples provides a convenient tradeoff between run-time performance and approximation accuracy; note that LS-DCUR does not require any additional parameters apart from the number of desired rows and columns.…”

Section: Methodsmentioning

confidence: 99%

“…Various extensions to this method have been proposed. For example, the approach in [26] further reduces computation times by avoiding to repeatedly sample the same row (or column).…”

Section: Cur Decompositionmentioning

confidence: 99%

“…The implicit selection criteria (large norm, large distance, orthogonality) have two interesting implications: first, the risk of multiple selections of the same sample is negligible. Thereby, similar to CMD [26], we reduce the size of the matrices C, R and decrease computation times for U (note that most available implementations of CUR already incorporate this step as did we in our experiments). Second, selection methods that apply importance sampling based on Euclidean are not applicable to matrices that consist of unit-norm vectors such as, for example, normalized Histograms.…”

Section: Selectionsmentioning

confidence: 99%

“…Corresponding approaches yield interpretable results because they embed the data in lower dimensional spaces whose basis vectors correspond to actual data points. They are guaranteed to preserve properties such as sparseness or nonnegativity and enjoy increasing popularity in the data mining community [3,11,12,17,21,22,26,28] where they have been applied to fraud detection, fMRI segmentation, collaborative filtering, and co-clustering.…”

Section: Introductionmentioning

confidence: 99%

“…Approaches that rely on probability distributions which depend on the statistical leverage, i.e. the top-k singular subspace of a matrix [22], or the Euclidean norm of rows and columns [9,26] were introduced under the name CUR decomposition. As the CUR decomposition relies on informed row-and column-sampling strategies, it can scale to large data but only if statistical leverage is consider for k being rather small since the CUR algorithm involves computing the SVD of a matrix.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Deterministic CUR for Improved Large-Scale Data Analysis: An Empirical Study

Thurau¹,

Kersting

Bauckhage³

2012

Proceedings of the 2012 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

Low-rank approximations which are computed from selected rows and columns of a given data matrix have attracted considerable attention lately. They have been proposed as an alternative to the SVD because they naturally lead to interpretable decompositions which was shown to be successful in application such as fraud detection, fMRI segmentation, and collaborative filtering. The CUR decomposition of large matrices, for example, samples rows and columns according to a probability distribution that depends on the Euclidean norm of rows or columns or on other measures of statistical leverage. At the same time, there are various deterministic approaches that do not resort to sampling and were found to often yield factorization of superior quality with respect to reconstruction accuracy. However, these are hardly applicable to large matrices as they typically suffer from high computational costs. Consequently, many practitioners in the field of data mining have abandon deterministic approaches in favor of randomized ones when dealing with today's large-scale data sets. In this paper, we empirically disprove this prejudice. We do so by introducing a novel, linear-time, deterministic CUR approach that adopts the recently introduced Simplex Volume Maximization approach for column selection. The latter has already been proven to be successful for NMF-like decompositions of matrices of billions of entries. Our exhaustive empirical study on more than 30 synthetic and real-world data sets demonstrates that it is also beneficial for CUR-like decompositions. Compared to other deterministic CUR-like methods, it provides comparable reconstruction quality but operates much faster so that it easily scales to matrices of billions of elements. Compared to sampling-based methods, it provides competitive reconstruction quality while staying in the same run-time complexity class.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Various extensions to this method have been proposed. For example, the approach in [26] further reduces computation times by avoiding to repeatedly sample the same row (or column).…”

Section: Cur Decompositionmentioning

confidence: 99%

Section: Selectionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Deterministic CUR for Improved Large-Scale Data Analysis: An Empirical Study

Thurau¹,

Kersting

Bauckhage³

2012

Proceedings of the 2012 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

Non‐negative residual matrix factorization: problem definition, fast solutions, and applications

Tong

Lin

2011

Statistical Analysis

View full text Add to dashboard Cite

Matrix factorization is a very powerful tool to find graph patterns, e.g. communities, anomalies, etc. A recent trend is to improve the usability of the discovered graph patterns, by encoding some interpretation-friendly properties (e.g., non-negativity, sparseness, etc) in the factorization. Most, if not all, of these methods are tailored for the task of community detection.We propose NrMF, a non-negative residual matrix factorization framework, aiming to improve the interpretation for graph anomaly detection. We present two optimization formations and their corresponding optimization solutions. Our method can naturally capture abnormal behaviors on graphs. We further generalize it to admit sparse constrains in the residual matrix. The effectiveness and efficiency of the proposed algorithms are analyzed, showing that our algorithm (i) leads to a local optima; and (ii) scales to large graphs. The experimental results on several data sets validate its effectiveness as well as efficiency.

show abstract

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Wei,

Wang,

Sun

et al. 2024

J Software Evolu Process

View full text Add to dashboard Cite

Distributed systems have been widely used in many safety‐critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance approaches for distributed systems, log‐based anomaly detection (LAD) has become a popular research topic. Its popularity relates to system logs being able to record and reveal important run‐time information. This paper presents a general LAD framework for distributed systems. Log grouping and feature‐pattern mining are two crucial LAD components that impact on the anomaly‐detection effectiveness. We also present a systematic survey of techniques in these two directions; propose classification frameworks for log grouping and feature patterns; and summarize four log‐grouping techniques and five feature patterns (which refer to invariant relationships among logs that can be used for anomaly detection). To evaluate their applicability, we report on the findings when applying existing techniques to Ray, a popular industrial distributed system. Based on these findings, several open issues are identified, which provide potential guidance for future research and development.

show abstract

Less is More: Compact Matrix Decomposition for Large Sparse Graphs

Cited by 85 publications

References 29 publications

Deterministic CUR for Improved Large-Scale Data Analysis: An Empirical Study

Deterministic CUR for Improved Large-Scale Data Analysis: An Empirical Study

Non‐negative residual matrix factorization: problem definition, fast solutions, and applications

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Contact Info

Product

Resources

About