A space and time efficient algorithm for SimRank computation

Yu, Weiren; Zhang, Wenjie; Lin, Xuemin; Zhang, Qing; Le, Jiajin

doi:10.1007/s11280-010-0100-6

Cited by 57 publications

(33 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since SimRank was first introduced by Jen and Widom [14] in 2002, there have been a lot of research works [21,19,26,18,30,28,10,12,6,27] to optimize the computation of SimRank. The proposed solutions can be classified into three categories.…”

Section: Related Workmentioning

confidence: 99%

“…Later, Lizorkin et al [21] improved the original solution via partial sum memorization to O(kdN 2 ) time. Yu et al [26] used fast matrix multiplication to speed up the all-pairs SimRank computation as well. Recently, Yu et al [28] further enhanced the SimRank computation to O(kd ′ N 2 ) time (with d ′ < d) through fine-grained memorization.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

An efficient similarity search framework for SimRank over large dynamic graphs

et al. 2015

View full text Add to dashboard Cite

SimRank is an important measure of vertex-pair similarity according to the structure of graphs. The similarity search based on Sim-Rank is an important operation for identifying similar vertices in a graph and has been employed in many data analysis applications. Nowadays, graphs in the real world become much larger and more dynamic. The existing solutions for similarity search are expensive in terms of time and space cost. None of them can efficiently support similarity search over large dynamic graphs. In this paper, we propose a novel two-stage random-walk sampling framework (TSF) for SimRank-based similarity search (e.g., top-k search). In the preprocessing stage, TSF samples a set of one-way graphs to index raw random walks in a novel manner within O(N Rg) time and space, where N is the number of vertices and Rg is the number of one-way graphs. The one-way graph can be efficiently updated in accordance with the graph modification, thus TSF is well suited to dynamic graphs. During the query stage, TSF can search similar vertices fast by naturally pruning unqualified vertices based on the connectivity of one-way graphs. Furthermore, with additional Rq samples, TSF can estimate the SimRank score with probability 1 − 2e −2ǫ 2 Rg Rq (1−c) 2 if the error of approximation is bounded by 1 − ǫ. Finally, to guarantee the scalability of TSF, the one-way graphs can also be compactly stored on the disk when the memory is limited. Extensive experiments have demonstrated that TSF can handle dynamic billion-edge graphs with high performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

An efficient similarity search framework for SimRank over large dynamic graphs

et al. 2015

View full text Add to dashboard Cite

show abstract

“…One fundamental task underpinning many graph mining problems such as recommender systems [14] and information retrieval [8] is the computation of similarity between objects. Among various ways of evaluating object similarity on graph [26,27,11], SimRank [15] is probably one of the most popular [10,22,21,19,29,16,23,28,25].…”

Section: Introductionmentioning

confidence: 99%

“…It may take up to O(n 3 ) time and O(n 2 ) space for computing the similarity of two nodes even on a sparse graph, which is clearly far from acceptable for large problems. Indeed, since it emerged [15], its scalability has been a critical issue of study [10,22,21,19,29,16,23,28]. Despite significant progress, there is still a large gap to a practically scalable solution.…”

Section: Introductionmentioning

confidence: 99%

Walking in the cloud

Fang

Liu

et al. 2015

Proc. VLDB Endow.

View full text Add to dashboard Cite

Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these practical issues. In this paper, we break such dependency for maximum efficiency possible. Our method consists of offline and online phases. In offline phase, a length-n indexing vector is derived by solving a linear system in parallel. At online query time, the similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space. Our algorithm, called CloudWalker, is highly parallelizable, with only linear time and space. Remarkably, it responses to both single-pair and single-source queries in constant time. CloudWalker is orders of magnitude more efficient and scalable than existing solutions for large-scale problems. Implemented on Spark with 10 machines and tested on the web-scale clue-web graph with 1 billion nodes and 43 billion edges, it takes 110 hours for offline indexing, 64 seconds for a single-pair query, and 188 seconds for a single-source query. To the best of our knowledge, our work is the first to report results on clueweb, which is 10x larger than the largest graph ever reported for SimRank computation.

show abstract

“…The intuition is that a page has high rank if the sum of the ranks of its incoming links is high. A time and space efficient algorithm to rank web documents based on a graph model on hyperlinks is proposed in [45]. Reference [8] presented a method to extend the link analysis from page level to block-level.…”

mentioning

confidence: 99%

A path-based approach for web page retrieval

García-Molina

2011

World Wide Web

View full text Add to dashboard Cite

Use of links to enhance page ranking has been widely studied. The underlying assumption is that links convey recommendations. Although this technique has been used successfully in global web search, it produces poor results for website search, because the majority of the links in a website are used to organize information and convey no recommendations. By distinguishing these two kinds of links, respectively for recommendation and information organization, this paper describes a path-based method for web page ranking. We define the Hierarchical Navigation Path (HNP) as a new resource for improving web search. HNP is composed of multi-step navigation information in visitors' website browsing. It provides indications of the content of the destination page. We first classify the links inside a website. Then, the links for web page organization are exploited to construct the HNPs for each page. Finally, the PathRank algorithm is described for web page retrieval. The experiments show that our approach results in significant improvements over existing solutions.

show abstract

A space and time efficient algorithm for SimRank computation

Cited by 57 publications

References 22 publications

An efficient similarity search framework for SimRank over large dynamic graphs

An efficient similarity search framework for SimRank over large dynamic graphs

Walking in the cloud

A path-based approach for web page retrieval

Contact Info

Product

Resources

About