Recently the Library of Congress began developing a strategy for the preservation of digital content. Efforts have focused on the need to select, harvest, describe, access and preserve Web resources. This poster focuses on the Library's initial investigation and evaluation of Web harvesting software tools.
WEB HARVESTING TOOLSWhile there are a number of tools for Web crawling, only a small subset of tools can handle the more specific and challenging tasks of large-scale harvesting for long-term preservation. From this subset, the Library is examining the following tools based on these criteria: open source, documented prior use, and an active community of developers and users.Our minimum test platform is a 1.8GHz Pentium 4, with 768 MB of RAM, and hundreds of gigabytes of disk space. We run Fedora Linux because of its mainstream usage and active community support. We are benchmarking the various processes in an attempt to estimate their scalability.HTTrack [1], a desktop crawler, is easy to configure, widely used and available for both Windows and Unix systems. However, HTTrack is best suited for an exploratory acquisition of a small number of sites. It modifies the links in retrieved content to create a self-consistent set of files that can be directly viewed without the need of a separate viewing tool. HTTrack is valuable for site analysis but not suitable for wide-scale harvesting.The open source NEDLIB Harvester [2], developed by a European consortium, is used by a number of national libraries. Despite its popularity in this small community, its development has been dormant since September 2002. NEDLIB relies on a relational database (MySQL) for its configuration and process control. While the database adds complexity to its use, it also provides the ability for extensive reporting. In our initial testing, NEDLIB's crawl configuration -performed by adding values for seeds, inclusions, and exclusions to database tables -was sufficiently expressive for general crawls. However, NEDLIB does not have the flexibility required by more complex permissions environments. The harvester lacks a direct user interface and communicates its progress through logging and database entries. Integrating NEDLIB into a regular harvesting workflow will require the development of a superstructure of tools. We are still establishing approaches for measuring crawler performance and quality, but were satisfied with NEDLIB's results on moderately sized (tens of gigabytes), narrowly scoped crawls.
Heritrix [3], initiated by the Internet Archive [4], was released to the public as recently as January 2004. The Library's initial testing of Heritrix has been promising, and the public nature of the software's development process instills confidence in its future improvement. Heritrix is driven by an XML configuration language, which supports complex crawl definitions and filtering. In addition, it appears to support advanced customization via Java plug-ins. Heritrix includes a Web hosted control panel for managing and monitoring crawls. Based ...