Routing Memento Requests Using Binary Classifiers

Bornand, Nicolas J.; Balakireva, Lyudmila; Sompel, Herbert Van de

doi:10.1145/2910896.2910899

Cited by 24 publications

(24 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Of course, the question is: how do we know where to send the URI lookups? We have explored a variety of methods, including building web archive profiles using their CDX files (produced for playback in the Wayback Machine archival playback software) and HTTP access logs of web archives (Alam et al, 2015;Alam et al, 2016b), machine learning techniques based training data from responses of web archives for a fixed set of URI lookups (Bornand et al, 2016) (this method is in production in the Time Travel service shown in Figures 14.14 and 14.15), and -for web archives that have a full-text index -extracting terms from archived documents and using those as queries back into the archive (Alam et al, 2016a).…”

Section: Routing Uri Requests To the Proper Web Archivesmentioning

confidence: 99%

Adding the Dimension of Time to HTTP

Nelson¹,

Sompel²

2019

The SAGE Handbook of Web History

Self Cite

View full text Add to dashboard Cite

IntroductIon While the web is distributed, most web archives are centralized silos that do not cooperate with each other. This is partially because the technology that is necessary to replay the archived content and keep it from being influenced by material on the live web also makes it difficult for web archives to cooperate. The Memento Protocol (which we played a central role in defining) addresses this problem by defining an extension to the Hypertext Transfer Protocol (HTTP) that allows for standardized, machine-readable integration of both the past web and the present web. The Memento Protocol extends the concept of HTTP content negotiation to include not only well-known dimensions such as Multipurpose Internet Mail Extensions (MIME) types (e.g., JPEG vs. PNG) and file encodings (e.g., gzip vs. compress), but also the dimension of Coordinated Universal Time (UTC) as a universal versioning system. The protocol can be supported by all systems that hold temporal 14 BK-SAGE-BRUGGER_MILLIGAN-180264-Chp14.indd 189

show abstract

Section: Routing Uri Requests To the Proper Web Archivesmentioning

confidence: 99%

Adding the Dimension of Time to HTTP

Nelson¹,

Sompel²

2019

The SAGE Handbook of Web History

Self Cite

View full text Add to dashboard Cite

show abstract

“…Unlike previous work [9], in order to generate the richest possible event collections, we are interested in obtaining Mementos from as many publicly available web archives around the world as possible. The Memento infrastructure, and in particular the Memento Aggregator [3], makes this possible. For each URI that needs to be crawled (seed URIs and URIs in the priority queue) until the crawler's stop condition is met, the crawler obtains a Memento of that URI that was archived temporally closest but after DTE.…”

Section: Web Archive Crawlsmentioning

confidence: 99%

“…Inspired by previous work by Gossen et al [9], in this paper, we present a framework to build eventspecific collections by focused crawling of web archives. We utilize the Memento protocol [20] and the associated crossweb-archive infrastructure [3] to crawl Mementos in 22 web archives. We build collections by evaluating the content-wise and temporal relevance of crawled resources and we compare the resulting collections with collections created on the basis of live web crawls and a manually curated Archive-It crawl.…”

Section: Introductionmentioning

confidence: 99%

Focused Crawl of Web Archives to Build Event Collections

Klein

Balakireva

Sompel

2018

Proceedings of the 10th ACM Conference on Web Science

Self Cite

View full text Add to dashboard Cite

Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past.

show abstract

“…In order to optimize this federated search, we first established a cache system. Based on its resulting log data we conducted a number of experiments to investigate the feasibility of simple machine learning classifiers to predict whether a web archive has Mementos of a requested URI-R [1]. The results were encouraging and hence we put the machine learning prediction process into production as well.…”

Section: Introductionmentioning

confidence: 99%

“…In case of a cache miss -no entry for the URI-R is found in the cache -the machine learning based process is started. As described in [1], it takes various simple characteristics of the URI-R into account (length, number of tokens, top level domain, etc.) and predicts the archives in which Mementos are available.…”

Section: Introductionmentioning

confidence: 99%

Evaluating Memento Service Optimizations

Klein

Balakireva

Shankar

2019

2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Self Cite

View full text Add to dashboard Cite

Services and applications based on the Memento Aggregator can suffer from slow response times due to the federated search across web archives performed by the Memento infrastructure. In an effort to decrease the response times, we established a cache system and experimented with machine learning models to predict archival holdings. We reported on the experimental results in previous work and can now, after these optimizations have been in production for two years, evaluate their efficiency, based on long-term log data. During our investigation we find that the cache is very effective with a 70 − 80% cache hit rate for human-driven services. The machine learning prediction operates at an acceptable average recall level of 0.727 but our results also show that a more frequent retraining of the models is needed to further improve prediction accuracy.

show abstract

Routing Memento Requests Using Binary Classifiers

Cited by 24 publications

References 16 publications

Adding the Dimension of Time to HTTP

Adding the Dimension of Time to HTTP

Focused Crawl of Web Archives to Build Event Collections

Evaluating Memento Service Optimizations

Contact Info

Product

Resources

About