Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

Sugiyama, Kazunari; Hatano, Kenji; Yoshikawa, Masayuki; Uemura, Shunsuke

doi:10.1145/900095.900096

Cited by 22 publications

(27 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The cosine TFIDF weighting scheme is widely used in IR to determine the similarity between two documents [2,[29][30][31]. However, its precision is not very high [34,35]. In this paper, we use it as a rough metric of similarity for the web pages in CW dataset.…”

Section: About the Cosine Tfidf Metricmentioning

confidence: 99%

MatchSim: a novel similarity measure based on maximum neighborhood matching

2011

View full text Add to dashboard Cite

Measuring object similarity in a graph is a fundamental data-mining problem in various application domains, including Web linkage mining, social network analysis, information retrieval, and recommender systems. In this paper, we focus on the neighbor-based approach that is based on the intuition that "similar objects have similar neighbors" and propose a novel similarity measure called MatchSim. Our method recursively defines the similarity between two objects by the average similarity of the maximum-matched similar neighbor pairs between them. We show that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank. Moreover, MatchSim can be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility. We present the MatchSim score computation process and prove its convergence. We also analyze its time and space complexity and suggest two accelerating techniques: (1) proposing a simple pruning strategy and (2) adopting an approximation algorithm for maximum matching computation. Experimental results on real-world datasets show that although our method is less efficient computationally, it outperforms classic methods in terms of accuracy.

show abstract

Section: About the Cosine Tfidf Metricmentioning

confidence: 99%

MatchSim: a novel similarity measure based on maximum neighborhood matching

2011

View full text Add to dashboard Cite

show abstract

“…We also plan to investigate additional IDF evaluation techniques, such as estimation based on limited crawls of hyperlinked neighboring pages [26]. Scraping Google for IDF values is not a viable long-term strategy, and at the very least we have not considered multi-lingual support in our prototype.…”

Section: Idfmentioning

confidence: 99%

Just-in-time recovery of missing web pages

Harrison

Nelson

2006

Proceedings of the Seventeenth Conference on Hypertext and Hypermedia

View full text Add to dashboard Cite

We present Opal, a light-weight framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers by mutual harvesting using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.

show abstract

“…As we have discussed earlier, FileRank is used by Eureka to scale the IR rankings of search results and bias them toward the more "important" files. Our approach is inspired by Hypertext [6,7] and Webbased [2,5] techniques, where the importance of a document is determined by the number and type of links that reach it. More formally, our technique performs a random walk over the semantic file graph where the probability of traversing a link is proportional to its weight.…”

Section: Filerank Computationmentioning

confidence: 99%

“…This paper describes Eureka, a file system search engine that employs a "structured" view of the world in order to improve the effectiveness of file searches. Eureka is inspired by research in the Web [2,5] and Hypertext [6,7] communities, which has shown that the overall structure in a collection of hyper-linked documents can play an important role in determining the importance and ranking of different documents. Based on this intuition, we develop a framework for inferring semantic links in a file system, thus transforming a "flat" collection of files in a graph of hyper-linked documents, and quantifying the importance of each file based on the characteristics of this semantic graph.…”

Section: Introductionmentioning

confidence: 99%

Searching a file system using inferred semantic links

Bhagwat

Polyzotis

2005

Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia

View full text Add to dashboard Cite

We describe Eureka, a file system search engine that takes into account the inherent relationships among files in order to improve the rankings of search results. The key idea behind our approach is a simple, yet powerful framework that automatically infers semantic links among files and thus transforms the file system in a network of hyper-linked documents. Based on this model, we propose the FileRank metric that examines the structure of the semantic graph and essentially quantifies the "importance" of each file in the file system. By combining FileRank with conventional IR metrics, Eureka can bias the rankings of the search results toward the more important files and thus provide more effective support in the task of locating useful files. We outline the design of the Eureka search engine and discuss the inference of semantic links and the computation of the FileRank metric.

show abstract

Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

Cited by 22 publications

References 0 publications

MatchSim: a novel similarity measure based on maximum neighborhood matching

MatchSim: a novel similarity measure based on maximum neighborhood matching

Just-in-time recovery of missing web pages

Searching a file system using inferred semantic links

Contact Info

Product

Resources

About