The National Archives and Records Administration (NARA) and EU SHAMAN projects are working with multiple research institutions on tools and technologies that will supply a comprehensive, systematic, and dynamic means for preserving virtually any type of electronic record, free from dependence on any specific hardware or software. This paper describes the joint development work between the University of Liverpool and the San Diego Supercomputer Center (SDSC) at the University of California, San Diego on the NARA and SHAMAN prototypes. The aim is to provide technologies in support of the required generic data management infrastructure. We describe a Theory of Preservation that quantifies how communication can be accomplished when future technologies are different from those available at present. This includes not only different hardware and software, but also different standards for encoding information. We describe the concept of a “digital ontology” to characterize preservation processes; this is an advance on the current OAIS Reference Model of providing representation information about records. To realize a comprehensive Theory of Preservation, we describe the ongoing integration of distributed shared collection management technologies, digital library browsing, and presentation technologies for the NARA and SHAMAN Persistent Archive Testbeds.
This paper explores the integration of text mining and data mining techniques, digital library systems, and computational and data grid technologies with the objective of developing an online classification service exemplar. We discuss the current research issues relating to the use of data mining algorithms and toolkits for textual data; the necessary changes within the Cheshire3 Information Framework to accommodate analysis workflows; the outcomes of a demonstrator based on the National Library of Medicine's Medline dataset; and the provision of comparable metrics for evaluation purposes. The prototype has resulted in extremely accurate online classification services and offers a novel method of supporting text mining and data mining within a highly scaled computational environment, integrated seamlessly into the digital library architecture.
Abstract. The Multivalent Document Model offers a practical, proven, nocompromises architecture for preserving digital documents of potentially any data format. We have implemented from scratch such complex and currently important formats as PDF and HTML, as well as older formats including scanned paper, UNIX manual pages, TeX DVI, and Apple II AppleWorks word processing. The architecture, stable since its definition in 1997, extends easily to additional document formats, defines a cross-format document tree data structure that fully captures semantics and layout, supports full expression of a format's often idiosyncratic concepts and behavior, enables sharing of functionality across formats thus reducing implementation effort, can introduce new functionality such as hyperlinks and annotation t o older formats that cannot express them, and provides a single interface (API) across all formats. Multivalent contrasts sharply with emulation and conversion, and advances Lorie's Universal Virtual Computer with high-level architecture and extensive implementation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.