2005
DOI: 10.1155/2005/962135
|View full text |Cite
|
Sign up to set email alerts
|

Interpreting the Data: Parallel Analysis with Sawzall

Abstract: Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
233
0
2

Year Published

2009
2009
2018
2018

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 376 publications
(235 citation statements)
references
References 14 publications
0
233
0
2
Order By: Relevance
“…In [2] we compared Storacle with state-of-the-art off-the-shelf NoSQL and SQL data bases by using relevant benchmarks and taking into account the limitation of storage size and processing resources that may be present at machines in a substation. A format evaluation in [2] suggested the use of the Protocol Buffer format [15] as basis as it leads to required data size and retrieval time superior to other potential data bases in this use case. In [2] we further listed Cube, RRD4J, Cassandra, InfluxDB, neo4j and OpenTSDB and described why it is not recommended to use these already existing time series database systems for this use case.…”
Section: Storaclementioning
confidence: 99%
“…In [2] we compared Storacle with state-of-the-art off-the-shelf NoSQL and SQL data bases by using relevant benchmarks and taking into account the limitation of storage size and processing resources that may be present at machines in a substation. A format evaluation in [2] suggested the use of the Protocol Buffer format [15] as basis as it leads to required data size and retrieval time superior to other potential data bases in this use case. In [2] we further listed Cube, RRD4J, Cassandra, InfluxDB, neo4j and OpenTSDB and described why it is not recommended to use these already existing time series database systems for this use case.…”
Section: Storaclementioning
confidence: 99%
“…Many of the individual systems that comprise this infrastructure have been the subject of academic publications [3,4,5,6,7,8,9,10] and received considerable interest, since they demonstrate practical approaches that have been deployed in live production environments on very large scales.…”
Section: Data-intensive Computingmentioning
confidence: 99%
“…For example, Sawzall [10] is an interpreted language for data analysis that is specifically designed to be integrated with MapReduce as an underlying execution engine. A Sawzall program conceptually executes in parallel for every record in a data set, and may produce output by emitting records to any number of declared aggregators.…”
Section: Workflow Composition and High-level Languagesmentioning
confidence: 99%
“…Several distributed job execution engines have been proposed [5,4,15,25], and several highlevel job description languages have been defined [7,16,[26][27][28]. However, complex scientific analysis tasks are only just beginning to be ported to these new platforms.…”
Section: Background and Related Workmentioning
confidence: 99%