Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries 2016
DOI: 10.1145/2910896.2910899
|View full text |Cite
|
Sign up to set email alerts
|

Routing Memento Requests Using Binary Classifiers

Abstract: The Memento protocol provides a uniform approach to query individual web archives. Soon after its emergence, Memento Aggregator infrastructure was introduced that supports querying across multiple archives simultaneously. An Aggregator generates a response by issuing the respective Memento request against each of the distributed archives it covers. As the number of archives grows, it becomes increasingly challenging to deliver aggregate responses while keeping response times and computational costs under contr… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6
2

Relationship

3
5

Authors

Journals

citations
Cited by 24 publications
(24 citation statements)
references
References 16 publications
0
24
0
Order By: Relevance
“…Of course, the question is: how do we know where to send the URI lookups? We have explored a variety of methods, including building web archive profiles using their CDX files (produced for playback in the Wayback Machine archival playback software) and HTTP access logs of web archives (Alam et al, 2015;Alam et al, 2016b), machine learning techniques based training data from responses of web archives for a fixed set of URI lookups (Bornand et al, 2016) (this method is in production in the Time Travel service shown in Figures 14.14 and 14.15), and -for web archives that have a full-text index -extracting terms from archived documents and using those as queries back into the archive (Alam et al, 2016a).…”
Section: Routing Uri Requests To the Proper Web Archivesmentioning
confidence: 99%
“…Of course, the question is: how do we know where to send the URI lookups? We have explored a variety of methods, including building web archive profiles using their CDX files (produced for playback in the Wayback Machine archival playback software) and HTTP access logs of web archives (Alam et al, 2015;Alam et al, 2016b), machine learning techniques based training data from responses of web archives for a fixed set of URI lookups (Bornand et al, 2016) (this method is in production in the Time Travel service shown in Figures 14.14 and 14.15), and -for web archives that have a full-text index -extracting terms from archived documents and using those as queries back into the archive (Alam et al, 2016a).…”
Section: Routing Uri Requests To the Proper Web Archivesmentioning
confidence: 99%
“…Unlike previous work [9], in order to generate the richest possible event collections, we are interested in obtaining Mementos from as many publicly available web archives around the world as possible. The Memento infrastructure, and in particular the Memento Aggregator [3], makes this possible. For each URI that needs to be crawled (seed URIs and URIs in the priority queue) until the crawler's stop condition is met, the crawler obtains a Memento of that URI that was archived temporally closest but after DTE.…”
Section: Web Archive Crawlsmentioning
confidence: 99%
“…Inspired by previous work by Gossen et al [9], in this paper, we present a framework to build eventspecific collections by focused crawling of web archives. We utilize the Memento protocol [20] and the associated crossweb-archive infrastructure [3] to crawl Mementos in 22 web archives. We build collections by evaluating the content-wise and temporal relevance of crawled resources and we compare the resulting collections with collections created on the basis of live web crawls and a manually curated Archive-It crawl.…”
Section: Introductionmentioning
confidence: 99%
“…In order to optimize this federated search, we first established a cache system. Based on its resulting log data we conducted a number of experiments to investigate the feasibility of simple machine learning classifiers to predict whether a web archive has Mementos of a requested URI-R [1]. The results were encouraging and hence we put the machine learning prediction process into production as well.…”
Section: Introductionmentioning
confidence: 99%
“…In case of a cache miss -no entry for the URI-R is found in the cache -the machine learning based process is started. As described in [1], it takes various simple characteristics of the URI-R into account (length, number of tokens, top level domain, etc.) and predicts the archives in which Mementos are available.…”
Section: Introductionmentioning
confidence: 99%