2012
DOI: 10.1007/978-3-642-33290-6_22
|View full text |Cite
|
Sign up to set email alerts
|

Identifying “Soft 404” Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections

Abstract: Abstract. Collections of Web-based resources are often decentralized; leaving the task of identifying and locating removed resources to collection managers who must rely on http response codes. When a resource is no longer available, the server is supposed to return a 404 error code. In practice and to be friendlier to human readers, many servers respond with a 200 OK code and indicate in the text of the response that the document is no longer available. In the reported study, 3.41% of servers respond in this … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2013
2013
2023
2023

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 15 publications
(8 citation statements)
references
References 26 publications
0
8
0
Order By: Relevance
“…A number of these categories were purposely left out of the evaluation of the classification algorithms as these cases can be handled by previous work. More specifically, detecting "blank pages", "failed redirects", "directory listings", "domain for sale" and "error pages" are handled with previous work on identifying Soft 404 error pages [6,28].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…A number of these categories were purposely left out of the evaluation of the classification algorithms as these cases can be handled by previous work. More specifically, detecting "blank pages", "failed redirects", "directory listings", "domain for sale" and "error pages" are handled with previous work on identifying Soft 404 error pages [6,28].…”
Section: Discussionmentioning
confidence: 99%
“…More so, Web documents are not static resources and a certain degree of change is expected from them [4]. Our current efforts continue long-standing study of the problems that surface when managing distributed collections and curating missing resources [5][6][7].…”
Section: Introductionmentioning
confidence: 99%
“…There are a number of categories that result when no content is available depending on how the servers are configured -blank pages, failed redirects, some directory listings, error pages, and university/institutional pages. In some of these cases, these pages can be detected with previous work on identifying Soft 404 error pages [4,5]. The remaining pages are perhaps the most problematic, when the web address has been taken over and is either for sale or being used for other purposes.…”
Section: Domain For Sale Pagesmentioning
confidence: 94%
“…Bar-Yossef et al [4] introduced the term, "Soft 404s" to identify Web pages that report a status code other than HTTP 404 despite the page not existing. Meneses et al [10] described the process of identifying "Soft 404s" based on a signature of the page's contents. In this work we describe "soft 3XXs" where content is returned from an archive with a status code of 200 yet the contents of the capture consist of Archive an archived HTTP 3XX redirect.…”
Section: Related Workmentioning
confidence: 99%