2009
DOI: 10.1145/1496909.1496926
|View full text |Cite
|
Sign up to set email alerts
|

Efficient detection of large-scale redundancy in enterprise file systems

Abstract: In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or nearduplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2010
2010
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(17 citation statements)
references
References 4 publications
0
17
0
Order By: Relevance
“…There are many identical directories created by individual users in enterprise file systems [11], such as software packages, copies of repositories, directories of photos or music. This suggests that if two directories share one file, other files may also be shared, a common form of file locality.…”
Section: B a Case For Exploiting File Semanticsmentioning
confidence: 99%
“…There are many identical directories created by individual users in enterprise file systems [11], such as software packages, copies of repositories, directories of photos or music. This suggests that if two directories share one file, other files may also be shared, a common form of file locality.…”
Section: B a Case For Exploiting File Semanticsmentioning
confidence: 99%
“…• A deterministic solution that reports the exact metrics before the actual migration (i.e we find the exact space reclamation they produce and the associated penalties, such as network costs and physical space consumption). • We find optimal datasets for space reclamation that are significantly and consistently better than alternatives namely the strategy of migrating unique files and the strategy based on MinHash [5].…”
Section: Introductionmentioning
confidence: 94%
“…Also, the notion of similarity is drawn from disk sharing relationships across files and not the semantic content of the documents. We show in our evaluation that grouping files using hashing techniques such as MinimumHash [4] and approximate MinHash [5] does not work well for space reclamation.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Copyright 2013 ACM 978-1-4503-2369-7/13/10, $15.00. applications in the context of content matching for online advertising [28], detection of redundancy in enterprise file systems [13], syntactic similarity algorithms for enterprise information management [8], Web spam [35], etc. The recent development of b-bit minwise hashing [23,24] provided a substantial improvement in the estimation accuracy and speed by proposing a new estimator that stores only the lowest b bits of each hashed value.…”
Section: Introductionmentioning
confidence: 99%