2014
DOI: 10.1007/s00799-014-0118-y
|View full text |Cite
|
Sign up to set email alerts
|

Profiling web archive coverage for top-level domain and content language

Abstract: Abstract. The Memento aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile twelve public web archives using data from a variety of sources (t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
13
0
2

Year Published

2014
2014
2022
2022

Publication Types

Select...
5
4
1

Relationship

2
8

Authors

Journals

citations
Cited by 39 publications
(18 citation statements)
references
References 35 publications
3
13
0
2
Order By: Relevance
“…First, they are all natively compliant with the Memento protocol making the lookup of a million URIs both feasible and scalable. Second, consistent with the findings in [46] , we observed during test runs that the vast majority of Mementos was provided by the pool of these six web archives and therefore felt that the marginal return on investment that would result from polling additional archives did not justify the additional resources required.…”
Section: Methodssupporting
confidence: 72%
“…First, they are all natively compliant with the Memento protocol making the lookup of a million URIs both feasible and scalable. Second, consistent with the findings in [46] , we observed during test runs that the vast majority of Mementos was provided by the pool of these six web archives and therefore felt that the marginal return on investment that would result from polling additional archives did not justify the additional resources required.…”
Section: Methodssupporting
confidence: 72%
“…As argued by Dougherty and Meyer [21], often the specific ontological and epistemological assumptions made during the collection and curation of web archives are either not made explicit to potential users or are seen as an impediment to their use and/or re-use [21]. Collection decisions, whether thematic or broad, domain-based harvesting [57], the temporal dimensions of when to capture and for how long [46] and issues over geographic [72] and language [1] coverage and representation of the global Web(s), have all highlighted methodological concerns over provenance, the subjectivity of records, the lack of transparency and metadata for harvester algorithms and problems with the generalisability of potential research findings based on web archives.…”
Section: Problematising Web Archival Practicesmentioning
confidence: 99%
“…In our earliest research in URI routing (AlSum et al, 2013(AlSum et al, , 2014, we proved that 52% of the time we can still produce a complete TimeMap querying only the top three (of a possible 12 available at that time) web archives even if we exclude the Internet Archive. Querying the top six web archives (excluding the Internet Archive) produces a complete TimeMap 77% of the time.…”
Section: Routing Uri Requests To the Proper Web Archivesmentioning
confidence: 97%