In this paper, we utilize a set of controlled experiments to benchmark the cost associated with the cloud execution of typical repository functions such as ingestion, fixity checking, and heavy data processing. We focus on the repository service pattern where content is explicitly stored away from where it is processed. We measured the processing speed and unit cost of each scenario using a large sensor dataset and Amazon Web Services (AWS). The initial results reveal three distinct cost patterns: 1) spend more to buy up to proportionally faster services; 2) more money does not necessarily buy better performance; and 3) spend less, but faster. Further investigations into these performance and cost patterns will help repositories to form a more effective operation strategy.
CCS
KeywordsInstitutional repository; Big data; Cloud computing; Cost analysis.
EXPERIMENT DESIGNWe designed three controlled experiments to execute typical repository tasks in AWS: 1) data ingestion, where File Information Tool Set (FITS) was used to characterize the data files and create associated metadata to the Fedora Objects to be ingested, 2) fixity checking, where new file digests were calculated from the ingested data then compared with their current digest values; and 3) heavy data processing, where multiple Fast Fourier Transformation (FFT) operations were performed against the ingested sensor data. To run these experiments we first installed a Fedora 4 based data repository using a m4.xlarge Elastic Compute Cloud (EC2) instance. This repository instance had a large EBS storage volume attached to it and all data deposited to the repository would be considered locally stored. The cost of this instance was not counted towards the execution costs. The data used for experiments were vibration signals collected from 214 accelerometers mounted in Virginia Tech's Goodwin Hall [1][2][3][4], an engineering building and a highly instrumented smart infrastructure laboratory facility. The data were written into one-minute interval zlib-compressed chunked HDF5 files. The experiments made use of three full days of data collected from the accelerometers totaling approximately 223GB. Data was stored at a temporary holding area in a Simple Storage Service (S3) bucket. We then allocated n EC2 instances, either in type t2.medium or m4.large, where n=1, 2, …9, to perform the processing. The S3, EBS storages and EC2 nodes were provisioned from the AWS US East Region, such that data movements among them were fast and free of charge. Figure 1 shows the speedup results of the three experiments. For the ingestion experiment, a linear speedup was consistently observed when using faster m4.large instances. This may be attributed to the vastly parallelizable workload. Because each execution is largely independent from the others in terms of resources needed, doubling the resources cuts the time in half. Situations were markedly different when using smaller, cheaper virtual instances. A superlinear speedup was on display when n < 5, then drifted to the...