Recently the Library of Congress began developing a strategy for the preservation of digital content. Efforts have focused on the need to select, harvest, describe, access and preserve Web resources. This poster focuses on the Library's initial investigation and evaluation of Web harvesting software tools. WEB HARVESTING TOOLSWhile there are a number of tools for Web crawling, only a small subset of tools can handle the more specific and challenging tasks of large-scale harvesting for long-term preservation. From this subset, the Library is examining the following tools based on these criteria: open source, documented prior use, and an active community of developers and users.Our minimum test platform is a 1.8GHz Pentium 4, with 768 MB of RAM, and hundreds of gigabytes of disk space. We run Fedora Linux because of its mainstream usage and active community support. We are benchmarking the various processes in an attempt to estimate their scalability.HTTrack [1], a desktop crawler, is easy to configure, widely used and available for both Windows and Unix systems. However, HTTrack is best suited for an exploratory acquisition of a small number of sites. It modifies the links in retrieved content to create a self-consistent set of files that can be directly viewed without the need of a separate viewing tool. HTTrack is valuable for site analysis but not suitable for wide-scale harvesting.The open source NEDLIB Harvester [2], developed by a European consortium, is used by a number of national libraries. Despite its popularity in this small community, its development has been dormant since September 2002. NEDLIB relies on a relational database (MySQL) for its configuration and process control. While the database adds complexity to its use, it also provides the ability for extensive reporting. In our initial testing, NEDLIB's crawl configuration -performed by adding values for seeds, inclusions, and exclusions to database tables -was sufficiently expressive for general crawls. However, NEDLIB does not have the flexibility required by more complex permissions environments. The harvester lacks a direct user interface and communicates its progress through logging and database entries. Integrating NEDLIB into a regular harvesting workflow will require the development of a superstructure of tools. We are still establishing approaches for measuring crawler performance and quality, but were satisfied with NEDLIB's results on moderately sized (tens of gigabytes), narrowly scoped crawls. Heritrix [3], initiated by the Internet Archive [4], was released to the public as recently as January 2004. The Library's initial testing of Heritrix has been promising, and the public nature of the software's development process instills confidence in its future improvement. Heritrix is driven by an XML configuration language, which supports complex crawl definitions and filtering. In addition, it appears to support advanced customization via Java plug-ins. Heritrix includes a Web hosted control panel for managing and monitoring crawls. Based ...
Preservation of digital content into the future will rely on the ability of institutions to provide robust system infrastructures that leverage the use of distributed and shared services and tools. The academic, nonprofit, and government entities that make up the National Digital Information Infrastructure and Preservation Program (NDIIPP) partner network have been working toward an architecture that can provide for reliable redundant geographically dispersed copies of their digital content. The NDIIPP program has conducted a set of initiatives that have enabled partners to better understand the requirements for effective collection interchange. The NDIIPP program partnered with the San Diego Supercomputer Center (SDSC) to determine the feasibility of data transmission and storage utilizing the best of breed technologies inherent to U.S. high-speed research networks and high-performance computing data storage infrastructures. The results of this partnership guided the development of the Library of Congress's cyberinfrastructure and its approach to network data transfer. Other NDIIPP partners, too, are researching a range of network architecture models for data exchange and storage. All of these explorations will build toward the development of best practices for sustainable interoperability and storage solutions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.