2012 International Conference for High Performance Computing, Networking, Storage and Analysis 2012
DOI: 10.1109/sc.2012.14
|View full text |Cite
|
Sign up to set email alerts
|

A study on data deduplication in HPC storage systems

Abstract: Deduplication is a storage saving technique that is highly successful in enterprise backup environments. On a file system, a single data block might be stored multiple times across different files, for example, multiple versions of a file might exist that are mostly identical. With deduplication, this data replication is localized and redundancy is removed – by storing data just\ud once, all files that use identical regions refer to the same unique data. The most common approach splits file data into chunks\ud and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
43
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 88 publications
(43 citation statements)
references
References 26 publications
0
43
0
Order By: Relevance
“…The zero chunk contributes significantly to the deduplication potential in enterprise backups and virtual machine images [55], [56]. In their HPC study, Meister et al found that between 3.1% and 24.3% of their HPC data consist of zero chunks [12]. In our case, the zero chunk is the most used chunk and is the main source of redundant data for every application and chunk size, except CDC with an average chunk size of 32 KB.…”
Section: A General Deduplicationmentioning
confidence: 57%
See 2 more Smart Citations
“…The zero chunk contributes significantly to the deduplication potential in enterprise backups and virtual machine images [55], [56]. In their HPC study, Meister et al found that between 3.1% and 24.3% of their HPC data consist of zero chunks [12]. In our case, the zero chunk is the most used chunk and is the main source of redundant data for every application and chunk size, except CDC with an average chunk size of 32 KB.…”
Section: A General Deduplicationmentioning
confidence: 57%
“…However, we vary the number of used processes in Section V-C. Table I shows the different sizes of the checkpoints. c) Deduplication: We analyzed each checkpoint with the FS-C deduplication tool suite [49], which has already been applied in several deduplication studies [50], [51]. We chose fixed-sized chunking and content-defined chunking (CDC) as chunking methods.…”
Section: Deduplication Of Checkpointsmentioning
confidence: 99%
See 1 more Smart Citation
“…Instead of storing duplicate data, a reference to the original block is created for each repeated occurrence. Our previously conducted study for HPC data already showed great potential for data savings, allowing 20-30 % of redundant data to be eliminated on average [21]. To determine the potential savings, we independently scanned 12 sets of directories with a total amount of data of more than 1 PB.…”
Section: Deduplicationmentioning
confidence: 99%
“…attempt reduce the checkpoint sizes. While there are several techniques proposed in this direction, recent studies [6] point out that deduplication (i.e. identifying identical copies of data and storing only one copy) shows promising potential, with reported reductions of up to 70%.…”
Section: Introductionmentioning
confidence: 99%