IEEE/ACM Joint Conference on Digital Libraries 2014
DOI: 10.1109/jcdl.2014.6970188
|View full text |Cite
|
Sign up to set email alerts
|

Finding pages on the unarchived Web

Abstract: Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the Dutch Web archive.Our main findings are threefold. Fir… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2014
2014
2019
2019

Publication Types

Select...
3
3
1

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 21 publications
0
9
0
Order By: Relevance
“…It is an open question whether the resulting derived representations-based on scant evidence of the pages-is a rich enough characterization to be of practical use. The current paper builds on earlier work on uncovering unarchived pages [30] and the recovery and evaluation of unarchived page descriptions [14]. Parts of the page-level results were presented in [14], now extended with the analysis of site-level representations to overcome anchor text sparsity at the page level [2].…”
Section: Link Evidence and Anchor Textmentioning
confidence: 97%
“…It is an open question whether the resulting derived representations-based on scant evidence of the pages-is a rich enough characterization to be of practical use. The current paper builds on earlier work on uncovering unarchived pages [30] and the recovery and evaluation of unarchived page descriptions [14]. Parts of the page-level results were presented in [14], now extended with the analysis of site-level representations to overcome anchor text sparsity at the page level [2].…”
Section: Link Evidence and Anchor Textmentioning
confidence: 97%
“…A recent investigation into the unarchived web [9] has shown that the archived web can be a rich source of links to potentially unarchived content. In this work, we crawl archived pages to increase the size and variety of our dataset.…”
Section: Related Workmentioning
confidence: 99%
“…Another frequently mentioned issue is related to completeness of the archive: Web archives contain only a small subset of the entire Web. Therefore we explored and evaluated ways to uncover and reconstruct unarchived pages, of which (anchor) references exist in the Dutch Web archive [14,20]. The analysis shows that a remarkable number of representations for unarchived pages could be generated, which we also intend to use to enhance search systems for Web archives.…”
Section: Progress and Future Workmentioning
confidence: 99%