Proceedings of the 10th ACM Conference on Web Science 2018
DOI: 10.1145/3201064.3201085
|View full text |Cite
|
Sign up to set email alerts
|

Focused Crawl of Web Archives to Build Event Collections

Abstract: Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 16 publications
(42 reference statements)
0
9
0
Order By: Relevance
“…The authors of [14] evaluated the feasibility of utilizing topical crawlers for building event collections on the existing web archives to reduce the time of the crawling process. They reported good results mostly for events that happened in the past.…”
Section: Topical Crawling Methodsmentioning
confidence: 99%
“…The authors of [14] evaluated the feasibility of utilizing topical crawlers for building event collections on the existing web archives to reduce the time of the crawling process. They reported good results mostly for events that happened in the past.…”
Section: Topical Crawling Methodsmentioning
confidence: 99%
“…The problem of knowing what to collect from the web has also been treated in the digital library research community as a focused crawling problem. In focused crawling the goal is to collect content about particular topics (Risse et al, 2012), events (Klein, Balakireva, & Van de Sompel, 2018;Yang, Chitturi, Wilson, Magdy, & Fox, 2012 ), or to collect content that has a particular characteristic such as popularity (Page, Brin, Motwani, & Winograd, 1999), importance Baeza-Yates, Marin, Castillo, & Rodriguez (2005)] or social engagement (Gossen, Demidova, & Risse, 2015 ;Milligan, Ruest, & Lin, 2016;Nwala, Weigle, & Nelson, 2018 ). Generally speaking these approaches take the focus to be a topic, event, person, organization that can be qualified by the types of media (documents, audio, video).…”
Section: Digital Librariesmentioning
confidence: 99%
“…Most focused crawling is performed on the live Web. Unfortunately, the live web is plagued by link rot and content drift, consequently, Klein et al [19] demonstrated that focused crawling on the archived Web results in more relevant collections than focused crawling on the live Web, for events that occurred in the distant past. Additionally, Klein et al proposed extracting seeds from external references contained in the Wikipedia page of an event.…”
Section: Related Workmentioning
confidence: 99%
“…In some other cases, archived collections are initiated months or years after the precipitating event. This could have serious consequences since Web archive collections that start late could omit webpages that address the early stages of events [19,24]. Consequently, it is important to start collecting seeds for Web archive collections early.…”
Section: Introductionmentioning
confidence: 99%