Adaptive on-line page importance computation

Abiteboul, Serge; Preda, Mihai; Cobena, Grégory

doi:10.1145/775189.775192

Cited by 52 publications

(82 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…PageRank and its variations are currently being used by major search engines. [1,15,16] describe various ways to improve PageRank computation. [2] provides a theoretical justification for the Hub and Authority metric and proposes a mechanism to combine link and text analysis for page ranking.…”

Section: Related Workmentioning

confidence: 99%

“…Quality change during measurement: In our theoretical derivations, we assumed that the quality remains constant during measurement. 1 This assumption is reasonable when we can measure the derivative instantaneously, but when it is measured over a time period, it is possible that the quality may change during the time.…”

Section: Measuring Quality From Web Snapshotsmentioning

confidence: 99%

See 1 more Smart Citation

Page quality

Cho

Roy

Adams

2005

Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

In a number of recent studies [4,8] researchers have found that because search engines repeatedly return currently popular pages at the top of search results, popular pages tend to get even more popular, while unpopular pages get ignored by an average user. This "rich-get-richer" phenomenon is particularly problematic for new and high-quality pages because they may never get a chance to get users' attention, decreasing the overall quality of search results in the long run. In this paper, we propose a new ranking function, called page quality that can alleviate the problem of popularity-based ranking. We first present a formal framework to study the search engine bias by discussing what is an "ideal" way to measure the intrinsic quality of a page. We then compare how PageRank, the current ranking metric used by major search engines, differs from this ideal quality metric. This framework will help us investigate the search engine bias in more concrete terms and provide clear understanding why PageRank is effective in many cases and exactly when it is problematic. We then propose a practical way to estimate the intrinsic page quality to avoid the inherent bias of PageRank. We derive our proposed quality estimator through a careful analysis of a reasonable web user model, and we present experimental results that show the potential of our proposed estimator. We believe that our quality estimator has the potential to alleviate the rich-getricher phenomenon and help new and high-quality pages get the attention that they deserve.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Measuring Quality From Web Snapshotsmentioning

confidence: 99%

Page quality

Cho

Roy

Adams

2005

Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…Abiteboul et al [3] designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). In OPIC, each page is given an initial sum of "cash" which is distributed equally among the pages it points to.…”

Section: Web Crawling Orderingmentioning

confidence: 99%

“…OPIC This strategy is based on OPIC [3], which can be seen as a weighted backlink-count strategy. All pages start with the same amount of "cash".…”

Section: Strategies With No Extra Informationmentioning

confidence: 99%

Crawling a country

Baeza-Yates

Castillo

Rodrı́guez

2005

Special Interest Tracks and Posters of the 14th International Conference on World Wide Web - WWW '05

View full text Add to dashboard Cite

This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth-first search and strategies based on partial Pagerank calculations.

show abstract

“…This decision is guided by the minimization of a cost function which prioritizes XML pages. Some parameters of this cost function are, for instance, the importance of the page [2], the estimated page frequency and the crawler bandwidth.…”

Section: The Sample Of the Xml Webmentioning

confidence: 99%

The XML web

Mignet

Barbosa

Veltri

2003

Proceedings of the Twelfth International Conference on World Wide Web - WWW '03

View full text Add to dashboard Cite

Although originally designed for large-scale electronic publishing, XML plays an increasingly important role in the exchange of data on the Web. In fact, it is expected that XML will become the lingua franca of the Web, eventually replacing HTML. Not surprisingly, there has been a great deal of interest on XML both in industry and in academia. Nevertheless, to date no comprehensive study on the XML Web (i.e., the subset of the Web made of XML documents only) nor on its contents has been made. This paper is the first attempt at describing the XML Web and the documents contained in it. Our results are drawn from a sample of a repository of the publicly available XML documents on the Web, consisting of about 200,000 documents. Our results show that, despite its short history, XML already permeates the Web, both in terms of generic domains and geographically. Also, our results about the contents of the XML Web provide valuable input for the design of algorithms, tools and systems that use XML in one form or another.

show abstract

Adaptive on-line page importance computation

Cited by 52 publications

References 0 publications

Page quality

Page quality

Crawling a country

The XML web

Contact Info

Product

Resources

About