Towards Use And Reuse Driven Big Data Management

Fox

2017

ISU

Self Cite

Abstract. Data-intensive science presents new opportunities as well as challenges to research libraries. The cyberinfrastructural challenge, although chiefly technological, also involves social-economic and human factors, therefore requires a deep understanding of what roles research libraries should play in the research lifecycle. This paper discusses the rationale and motivations behind a research project to investigate effective library big data cyberinfrastructure strategies.

Section: Use and Reuse Driven Big Data Managementmentioning

confidence: 99%

Advancing library cyberinfrastructure for big data sharing and reuse

Fox

2017

ISU

Self Cite

“…The cost of this instance was not counted towards the execution costs. The data used for experiments were vibration signals collected from 214 accelerometers mounted in Virginia Tech's Goodwin Hall [1][2][3][4], an engineering building and a highly instrumented smart infrastructure laboratory facility. The data were written into one-minute interval zlib-compressed chunked HDF5 files.…”

Section: Experiments Designmentioning

confidence: 99%

Evaluating Cost of Cloud Execution in a Data Repository

Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries

Chen

Speer

et al. 2016

Self Cite

In this paper, we utilize a set of controlled experiments to benchmark the cost associated with the cloud execution of typical repository functions such as ingestion, fixity checking, and heavy data processing. We focus on the repository service pattern where content is explicitly stored away from where it is processed. We measured the processing speed and unit cost of each scenario using a large sensor dataset and Amazon Web Services (AWS). The initial results reveal three distinct cost patterns: 1) spend more to buy up to proportionally faster services; 2) more money does not necessarily buy better performance; and 3) spend less, but faster. Further investigations into these performance and cost patterns will help repositories to form a more effective operation strategy. CCS KeywordsInstitutional repository; Big data; Cloud computing; Cost analysis. EXPERIMENT DESIGNWe designed three controlled experiments to execute typical repository tasks in AWS: 1) data ingestion, where File Information Tool Set (FITS) was used to characterize the data files and create associated metadata to the Fedora Objects to be ingested, 2) fixity checking, where new file digests were calculated from the ingested data then compared with their current digest values; and 3) heavy data processing, where multiple Fast Fourier Transformation (FFT) operations were performed against the ingested sensor data. To run these experiments we first installed a Fedora 4 based data repository using a m4.xlarge Elastic Compute Cloud (EC2) instance. This repository instance had a large EBS storage volume attached to it and all data deposited to the repository would be considered locally stored. The cost of this instance was not counted towards the execution costs. The data used for experiments were vibration signals collected from 214 accelerometers mounted in Virginia Tech's Goodwin Hall [1][2][3][4], an engineering building and a highly instrumented smart infrastructure laboratory facility. The data were written into one-minute interval zlib-compressed chunked HDF5 files. The experiments made use of three full days of data collected from the accelerometers totaling approximately 223GB. Data was stored at a temporary holding area in a Simple Storage Service (S3) bucket. We then allocated n EC2 instances, either in type t2.medium or m4.large, where n=1, 2, …9, to perform the processing. The S3, EBS storages and EC2 nodes were provisioned from the AWS US East Region, such that data movements among them were fast and free of charge. Figure 1 shows the speedup results of the three experiments. For the ingestion experiment, a linear speedup was consistently observed when using faster m4.large instances. This may be attributed to the vastly parallelizable workload. Because each execution is largely independent from the others in terms of resources needed, doubling the resources cuts the time in half. Situations were markedly different when using smaller, cheaper virtual instances. A superlinear speedup was on display when n < 5, then drifted to the...

“…A use and reuse driven approach to manage big data [9] differs from the traditional library repository in that the emphasis is geared more towards serving the researcher's needs to answer domain-specific research questions, instead of building "preservation-ready" systems to satisfy the librarian's urge to document and arrange materials in certain ways to facilitate unspecified future access. The argument is that unless we make fresh data immediately usable and reusable to researchers in their research process, the data will quickly turn cold, become less valuable for long-term preservation, and crowd out limited IT resources for big data management.…”

Section: Use and Reuse Driven Big Data Managementmentioning

confidence: 99%

“…We need to add an important component missing from the traditional library repository, namely a co-located data analysis infrastructure, to accomplish the goals laid out. Our prior research [9] compared a number of IT infrastructure options with which the use and reuse driven approach may be implemented. Given the IT environment and conditions currently prevalent in most academic libraries, we proposed the public cloud as a viable candidate.…”

Section: Use and Reuse Driven Big Data Managementmentioning

confidence: 99%

See 1 more Smart Citation

On-Demand Big Data Analysis in Digital Repositories: A Lightweight Approach

Digital Libraries: Providing Quality Information

Chen

Jiang

et al. 2015

Self Cite

Abstract. We describe a use and reuse driven digital repository integrated with lightweight data analysis capabilities provided by the Docker framework. Using building sensor data collected from the Virginia Tech Goodwin Hall Living Laboratory, we perform evaluations using Amazon EC2 and Container Service with a Fedora 4 repository backed with storage in Amazon S3. The results confirm the viability and benefits of this approach.