2014
DOI: 10.1007/s00799-014-0111-5
|View full text |Cite
|
Sign up to set email alerts
|

Who and what links to the Internet Archive

Abstract: The Internet Archive's (IA) Wayback Machine is the largest and oldest public web archive and has become a significant repository of our recent history and cultural heritage. Despite its importance, there has been little research about how it is discovered and used. Based on web access logs, we analyze what users are looking for, why they come to IA, where they come from, and how pages link to IA. We find that users request English pages the most, followed by the European languages. Most human users come to web… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
2
1

Relationship

3
5

Authors

Journals

citations
Cited by 30 publications
(14 citation statements)
references
References 36 publications
0
14
0
Order By: Relevance
“…However, in this work, we assume that the requested web page is neither available in web archives nor on the live web and thus is considered to be a "lost" web page. This assumption reflects previous work showing that users often search web archives when they cannot find the desired web page on the live web [5] and that there are a significant number of web pages that are not archived [1,3]. Learning about a requested web page without examining the content of the page can be challenging due to little context and content available.…”
Section: Introductionmentioning
confidence: 56%
See 1 more Smart Citation
“…However, in this work, we assume that the requested web page is neither available in web archives nor on the live web and thus is considered to be a "lost" web page. This assumption reflects previous work showing that users often search web archives when they cannot find the desired web page on the live web [5] and that there are a significant number of web pages that are not archived [1,3]. Learning about a requested web page without examining the content of the page can be challenging due to little context and content available.…”
Section: Introductionmentioning
confidence: 56%
“…The stop words were based on a set of stop words in the Natural Language Toolkit (NLTK) 5 . We found that using the all-grams from the URI after removing the TLD and numbers had the highest F 1 score, which was comparable to results obtained in related work [25].…”
Section: Step 1: First-level Classificationmentioning
confidence: 99%
“…Archival materials were obtained from the Internet Archive (IA). The Internet Archive's Wayback Machine is the largest and oldest public Web archive and has become a significant repository of our recent history and cultural heritage [22]. Internet archive resources are provided free of charge to historians, researchers, and for education purposes.…”
Section: Methodsmentioning
confidence: 99%
“…This search analysis was conducted on the general use of the Portuguese Web Archive (Costa and Silva, 2011). Another study on the Wayback Machine indicates that its users mostly request English pages, followed by pages written in European languages (AlNoamany et al, 2014). Most of these users probably search for pages archived in the Wayback Machine because the requested pages no longer exist on the live Web.…”
Section: Why What and How Do Users Search?mentioning
confidence: 99%