2014 30th Symposium on Mass Storage Systems and Technologies (MSST) 2014
DOI: 10.1109/msst.2014.6855542
|View full text |Cite
|
Sign up to set email alerts
|

The case for sampling on very large file systems

Abstract: Sampling has long been a prominent tool in statistics and analytics, first and foremost when very large amounts of data are involved. In the realm of very large file systems (and hierarchical data stores in general), however, sampling has mostly been ignored and for several good reasons. Mainly, running sampling in such an environment introduces technical challenges that make the entire sampling process non-beneficial. In this work we demonstrate that there are cases for which sampling is very worthwhile in ve… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2015
2015
2015
2015

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 22 publications
0
1
0
Order By: Relevance
“…This algorithm, too, is efficient and can be applied to data streams. Random sampling is still an active research field and new sampling schemes are studied in various contexts; some indicative examples are sampling from sliding windows [13], from distributed data streams [4,15,5], from streams with time decay [6], independent range sampling [10], sampling on very large file systems [9], and stratified reservoir sampling [2]. In light of the above results (which are mainly from the data streams field), we consider the algorithms of [3] and [8] as fundamental sampling schemes for general purpose weighted random sampling over data streams.…”
Section: Introductionmentioning
confidence: 99%
“…This algorithm, too, is efficient and can be applied to data streams. Random sampling is still an active research field and new sampling schemes are studied in various contexts; some indicative examples are sampling from sliding windows [13], from distributed data streams [4,15,5], from streams with time decay [6], independent range sampling [10], sampling on very large file systems [9], and stratified reservoir sampling [2]. In light of the above results (which are mainly from the data streams field), we consider the algorithms of [3] and [8] as fundamental sampling schemes for general purpose weighted random sampling over data streams.…”
Section: Introductionmentioning
confidence: 99%