Abstract. Collections of Web-based resources are often decentralized; leaving the task of identifying and locating removed resources to collection managers who must rely on http response codes. When a resource is no longer available, the server is supposed to return a 404 error code. In practice and to be friendlier to human readers, many servers respond with a 200 OK code and indicate in the text of the response that the document is no longer available. In the reported study, 3.41% of servers respond in this manner. To help collection managers identify these "friendly" or "soft" 404s, we developed two methods that use a Naïve Bayes classifier based on known valid responses and known 404 responses. The classifier was able to predict soft 404 pages with a precision of 99% and a recall of 92%. We will also elaborate on the results obtained from our study and will detail the lessons learned.Keywords: Soft 404, Web resource management, distributed collections. IntroductionVannevar Bush in his pioneering 1945 essay "As We May Think" [1] envisions a time in which the world's knowledge is accessible by machine and in which the connections that describe the higher-level relationships among sources are themselves objects of scholarship that can be shared with colleagues. We can see this today on the Web, with the utility of resource lists such as Yahoo and the investigation of mechanisms such as our own Walden's Paths [2,3]. Such interconnections of documents is a natural side effect of collaboration and cooperation, so as the problems to be solved grow beyond the technical abilities of an individual scholar and as social media becomes more embedded into our work practices, the presence of resources that situate knowledge into the broader environment will become ever more prevalent. A factor not considered by Bush but critical in today's networked world is that of administrative ownership of data. Information today is not contained in neatlydefined book-like units that can be replicated and stored locally in libraries. Instead the administrative control of information related to a topic may be spread across 198 L. Meneses, R. Furuta, and F. Shipman digital collections maintained by multiple scholars in multiple institutions. Administrative decentralization often is a critical factor in engaging a scholar to put in the work needed to create a valuable resource-the sense of ownership and control is motivating and often a necessary condition both for scholar and also for institution. Some of this need also centers on the desire to have a canonical copy of the resource-multiple copies in multiple locations can, and often do, diverge over time.Administrative decentralization, though, leads to changes that are unexpected by the maintainer of a "meta-resource"-a resource created by tying together the existing resources. Individual collections can change in many ways, both intentional and unintentional. Change may be because of deliberate actions on part of the collector-for example, reorganization of the structure of the collect...
It is not unusual for documents on the Web to degrade and suffer from problems associated with unexpected change. In an analysis of the Association for Computing Machinery conference list, we found that categorizing the degree of change affecting digital documents over time is a difficult task. More specifically, we found that categorizing this degree of change is not a binary problem where documents are either unchanged or they have changed so dramatically that they do not fit within the scope of the collection. It is in part, a characterization of the intent of the change. In this paper, we present a case study that compares change detection methods based on machine learning algorithms against the assessment made by human subjects in a user study. Consequently, this paper will focus on two research questions. First, how can we categorize the various degrees of change that documents endure? And second, how did our automatic detection methods fare against the human assessment of change in the ACM conference list?
It is not unusual for digital collections to degrade and suffer from problems associated with unexpected change. In an analysis of the ACM conference list, we found that categorizing the degree of change affecting a digital collection over time is a difficult task. More specifically, we found that categorizing this degree of change is not a binary problem where documents are either unchanged or they have changed so dramatically that they do not fit within the scope of the collection. It is, in part, a characterization of the intent of the change. In this work, we examine and categorize the various degrees of change that digital documents endure within the boundaries of an institutionally managed repository.
El estudio de los Sistemas de Información Geográfica se lo ha realizado desde hace mucho tiempo pero en la actualidad han tenido mayor acogida y en casi todas las áreas, cuando se empieza a desarrollar sistemas de información geográfica siempre se opta por utilizar herramientas comerciales o de pago, sin embargo existen en el medio, herramientas gratuitas que nos permiten realizar las mismas aplicaciones SIG sin ningún costo. Este artículo habla acerca de los SIG y muestra las diferentes herramientas libres mayormente conocidas en el medio que sirven para desarrollar este tipo de aplicaciones. En lo profesional la comunidad que se dedica a desarrollar este tipo de aplicaciones tendrán una pauta para poder elegir cuál será la mejor en cada uno de sus proyectos a implementar tomando en cuenta la utilización de estas herramientas. Por último se habla acerca de diferentes herramientas propietarias y libres para desarrollar SIG tomando en cuenta la prioridad en la utilización de las herramientas libres como son: ArgGIS, GeoMedia, GrassGIS, gvSIG, QuantumGIS, etc
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.