Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability.Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.
Abstract-Standard ad hoc routing protocols do not work in intermittently connected networks since end-to-end paths may not exist in such networks. A store-and-forward approach [7] has been proposed for such networks. The nodes in such networks move around. Thus, the proposed delay tolerant network (DTN) architecture [7] needs to be enhanced with a mobility management scheme to ensure that nodes that wish to correspond with mobile hosts have a way of determining their whereabouts. The mobile hosts may move a short distance and hence remain within the vicinity of a DTN name registrar (DNR) (one communication link away) or they may move far away (multiple communication links away). In this paper, we present the mobility management scheme we propose for DTN environments.In addition, we provide simple analytical formulae to evaluate the latency required for performing location updates, and the useful utilization that each node can use for data transfer assuming that the communication links between nodes are periodically available for a short period of time. Our simple analytical model allows us to draw insights into the impact of near/far movements on the useful utilization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.