Finding pages on the unarchived Web

Huurdeman, Hugo; Ben-David, Anat; Kamps, Jaap; Samar, Thaer; Vries, Arjen P. de

doi:10.1109/jcdl.2014.6970188

Cited by 7 publications

(9 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is an open question whether the resulting derived representations-based on scant evidence of the pages-is a rich enough characterization to be of practical use. The current paper builds on earlier work on uncovering unarchived pages [30] and the recovery and evaluation of unarchived page descriptions [14]. Parts of the page-level results were presented in [14], now extended with the analysis of site-level representations to overcome anchor text sparsity at the page level [2].…”

Section: Link Evidence and Anchor Textmentioning

confidence: 97%

Lost but not forgotten: finding pages on the unarchived web

et al. 2015

Self Cite

View full text Add to dashboard Cite

Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have The Open University, Ra'anana, Israel more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.

show abstract

Section: Link Evidence and Anchor Textmentioning

confidence: 97%

Lost but not forgotten: finding pages on the unarchived web

et al. 2015

Self Cite

View full text Add to dashboard Cite

show abstract

“…A recent investigation into the unarchived web [9] has shown that the archived web can be a rich source of links to potentially unarchived content. In this work, we crawl archived pages to increase the size and variety of our dataset.…”

Section: Related Workmentioning

confidence: 99%

How Well Are Arabic Websites Archived?

Alkwai

Nelson

Weigle

2015

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

View full text Add to dashboard Cite

It is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google (www.google.com), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not wellarchived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.

show abstract

“…Another frequently mentioned issue is related to completeness of the archive: Web archives contain only a small subset of the entire Web. Therefore we explored and evaluated ways to uncover and reconstruct unarchived pages, of which (anchor) references exist in the Dutch Web archive [14,20]. The analysis shows that a remarkable number of representations for unarchived pages could be generated, which we also intend to use to enhance search systems for Web archives.…”

Section: Progress and Future Workmentioning

confidence: 99%

Adaptive search systems for web archive research

Huurdeman¹

2014

Proceedings of the 5th Information Interaction in Context Symposium

Self Cite

View full text Add to dashboard Cite

The wealth of digital information available in our time has become indispensable for a rich variety of tasks. We use data on the Web for work, leisure, and research, aided by various search systems, allowing us to find small needles in giant haystacks. Despite recent advances in personalization and contextualization, however, various types of tasks, ranging from simple lookup tasks to complex, exploratory and analytical ventures, are mainly supported in elementary, "onesize-fits-all" search interfaces.Web archives, keepers of our future cultural heritage, have gathered petabytes of valuable Web data, which characterize our times for future generations. Access to these archives, however, is surprisingly limited: online Web archives usually provide a URL-based Wayback Machine interface, sometimes extended with rudimentary search options. As a result of limited access, Web archives have not been widely used for research so far. For emerging research using Web archives, there is a need to move beyond URL-based and simple search access, towards providing support for complex (re)search tasks.In my thesis, I am exploring ways to move beyond the "one-size-fits-all" approach for search systems, and I work on systems which can support the flow of complex search, also in the context of archived Web data. Rich models of search and research can be incorporated into adaptive search systems, supporting search strategies in various stages of complex search tasks. Concretely, I look at the use case of the Humanities researcher, for which the large, Terabytescale Web archives can be a valuable addition to existing sources utilized to perform research.

show abstract

Finding pages on the unarchived Web

Cited by 7 publications

References 21 publications

Lost but not forgotten: finding pages on the unarchived web

Lost but not forgotten: finding pages on the unarchived web

How Well Are Arabic Websites Archived?

Adaptive search systems for web archive research

Contact Info

Product

Resources

About