Profiling web archive coverage for top-level domain and content language

AlSum, Ahmed; Weigle, Michele C.; Nelson, Michael L.; Sompel, Herbert Van de

doi:10.1007/s00799-014-0118-y

Cited by 39 publications

(18 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…First, they are all natively compliant with the Memento protocol making the lookup of a million URIs both feasible and scalable. Second, consistent with the findings in [46] , we observed during test runs that the vast majority of Mementos was provided by the pool of these six web archives and therefore felt that the marginal return on investment that would result from polling additional archives did not justify the additional resources required.…”

Section: Methodssupporting

confidence: 72%

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

et al. 2014

Self Cite

View full text Add to dashboard Cite

The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to a web-based scholarly communication system also poses challenges. In this paper, we focus on reference rot, the combination of link rot and content drift to which references to web resources included in Science, Technology, and Medicine (STM) articles are subject. We investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication. We do so on the basis of a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we determine whether the HTTP URI is still responsive on the live web and whether web archives contain an archived snapshot representative of the state the referenced resource had at the time it was referenced. We observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten. We suggest that, in order to safeguard the long-term integrity of the web-based scholarly record, robust solutions to combat the reference rot problem are required. In conclusion, we provide a brief insight into the directions that are explored with this regard in the context of the Hiberlink project.

show abstract

Section: Methodssupporting

confidence: 72%

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

et al. 2014

Self Cite

View full text Add to dashboard Cite

show abstract

“…As argued by Dougherty and Meyer [21], often the specific ontological and epistemological assumptions made during the collection and curation of web archives are either not made explicit to potential users or are seen as an impediment to their use and/or re-use [21]. Collection decisions, whether thematic or broad, domain-based harvesting [57], the temporal dimensions of when to capture and for how long [46] and issues over geographic [72] and language [1] coverage and representation of the global Web(s), have all highlighted methodological concerns over provenance, the subjectivity of records, the lack of transparency and metadata for harvester algorithms and problems with the generalisability of potential research findings based on web archives.…”

Section: Problematising Web Archival Practicesmentioning

confidence: 99%

Observing Web Archives

Ogden

Halford

Carr

2017

Proceedings of the 2017 ACM on Web Science Conference

View full text Add to dashboard Cite

This paper makes the case for studying the work of web archivists, in an effort to explore the ways in which practitioners shape the preservation and maintenance of the archived Web in its various forms. An ethnographic approach is taken through the use of observation, interviews and documentary sources over the course of several weeks in collaboration with web archivists, engineers and managers at the Internet Archive -a private, non-profit digital library that has been archiving the Web since 1996. The concept of web archival labour is proposed to encompass and highlight the ways in which web archivists (as both networked human and nonhuman agents) shape and maintain the preserved Web through work that is often embedded in and obscured by the complex technical arrangements of collection and access. As a result, this engagement positions web archives as places of knowledge and cultural production in their own right, revealing new insights into the performative nature of web archiving that have implications for how these data are used and understood. 1

show abstract

“…In our earliest research in URI routing (AlSum et al, 2013(AlSum et al, , 2014, we proved that 52% of the time we can still produce a complete TimeMap querying only the top three (of a possible 12 available at that time) web archives even if we exclude the Internet Archive. Querying the top six web archives (excluding the Internet Archive) produces a complete TimeMap 77% of the time.…”

Section: Routing Uri Requests To the Proper Web Archivesmentioning

confidence: 97%

Adding the Dimension of Time to HTTP

Nelson¹,

Sompel²

2019

The SAGE Handbook of Web History

Self Cite

View full text Add to dashboard Cite

IntroductIon While the web is distributed, most web archives are centralized silos that do not cooperate with each other. This is partially because the technology that is necessary to replay the archived content and keep it from being influenced by material on the live web also makes it difficult for web archives to cooperate. The Memento Protocol (which we played a central role in defining) addresses this problem by defining an extension to the Hypertext Transfer Protocol (HTTP) that allows for standardized, machine-readable integration of both the past web and the present web. The Memento Protocol extends the concept of HTTP content negotiation to include not only well-known dimensions such as Multipurpose Internet Mail Extensions (MIME) types (e.g., JPEG vs. PNG) and file encodings (e.g., gzip vs. compress), but also the dimension of Coordinated Universal Time (UTC) as a universal versioning system. The protocol can be supported by all systems that hold temporal 14 BK-SAGE-BRUGGER_MILLIGAN-180264-Chp14.indd 189

show abstract

Profiling web archive coverage for top-level domain and content language

Cited by 39 publications

References 35 publications

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

Observing Web Archives

Adding the Dimension of Time to HTTP

Contact Info

Product

Resources

About