Abstract. Simplex Volume Maximization (SiVM) exploits distance geometry for efficiently factorizing gigantic matrices. It was proven successful in game, social media, and plant mining. Here, we review the distance geometry approach and argue that it generally suggests to factorize gigantic matrices using search-based instead of optimization techniques.
Interpretable Matrix FactorizationMany modern data sets are available in form of a real-valued m × n matrix V of rank r ≤ min(m, n). The columns v 1 , . . . , v n of such a data matrix encode information about n objects each of which is characterized by m features. Typical examples of objects include text documents, digital images, genomes, stocks, or social groups. Examples of corresponding features are measurements such as term frequency counts, intensity gradient magnitudes, or incidence relations among the nodes of a graph. In most modern settings, the dimensions of the data matrix are large so that it is useful to determine a compressed representation that may be easier to analyze and interpret in light of domain-specific knowledge. Formally, compressing a data matrix V ∈ R m×n can be cast as a matrix factorization (MF) task. The idea is to determine factor matrices W ∈ R m×k and H ∈ R k×n whose product is a low-rank approximation of V. Formally, this amounts to a minimization problem min W, H V − WH 2 where · denotes a suitable matrix norm, and one typically assumes k r. A common way of obtaining a low-rank approximation stems from truncating the singular value decomposition (SVD) where V = WSU T = WH. The SVD is popular for it can be solved analytically and has significant statistical properties. The column vectors w i of W are orthogonal basis vectors that coincide with the directions of largest variance in the data. Although there are many successful applications of the SVD, for instance in information retrieval, it has been criticized because the w i may lack interpretability with respect to the field from which the data are drawn [6]. For example, the w i may point in the direction of negative orthants even though the data itself is strictly non-negative. Nevertheless, data analysts are often tempted to reify, i.e., to assign a "physical"The authors would like to thank the anonymous reviewers for their comments.