Continuous analytics over discontinuous streams

Krishnamurthy, Sailesh; Franklin, Michael J.; Davis, Jeffrey R.; Farina, Daniel; Golovko, Pasha; Li, Alan; Thombre, Neil

doi:10.1145/1807167.1807290

Cited by 84 publications

(47 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the flip side, precise order semantics is sometimes costly to guarantee while not even necessary for realistic workloads [13]. Punctuations [14,15] were thus suggested as a mechanism to allow for local disorder within the stream.…”

Section: Order Of Results Tuplesmentioning

confidence: 99%

Low-latency handshake join

2014

View full text Add to dashboard Cite

This work revisits the processing of stream joins on modern hardware architectures. Our work is based on the recently proposed handshake join algorithm, which is a mechanism to parallelize the processing of stream joins in a NUMA-aware and hardware-friendly manner. Handshake join achieves high throughput and scalability, but it suffers from a high latency penalty and a non-deterministic ordering of the tuples in the physical result stream. In this paper, we first characterize the latency behavior of the handshake join and then propose a new low-latency handshake join algorithm, which substantially reduces latency without sacrificing throughput or scalability. We also present a technique to generate punctuated result streams with very little overhead; such punctuations allow the generation of correctly ordered physical output streams with negligible effect on overall throughput and latency.

show abstract

Section: Order Of Results Tuplesmentioning

confidence: 99%

Low-latency handshake join

2014

View full text Add to dashboard Cite

show abstract

“…This computation can be performed using an efficient incremental reduce operator that adds the old counts computed at t + 1 to the counts of new records since then, avoiding wasted work. This approach is similar to "order-independent processing" [19].…”

Section: Timing Considerationsmentioning

confidence: 99%

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

Zaharia¹,

Das²,

Li³

et al. 2012

306

317

View full text Add to dashboard Cite

Many "big data" applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databasesparallel recovery of lost state-and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes.

show abstract

“…This computation can be performed with an efficient incremental reduce operation that adds the old counts computed at t + 1 to the counts of new records since then, avoiding wasted work. This approach is similar to order-independent processing [67].…”

Section: Timing Considerationsmentioning

confidence: 99%

“…They have been studied in detail in databases [67,99]. In general, any such technique can be implemented over D-Streams by "discretizing" its computation in small batches (running the same logic in batches).…”

Section: Timing Considerationsmentioning

confidence: 99%

An Architecture for Fast and General Data Processing on Large Clusters

Zaharia

2016

View full text Add to dashboard Cite

The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to distributed systems. Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data, making it harder and harder to put to use. As a result, a growing number of organizations-not just web companies, but traditional enterprises and research labs-need to scale out their most important computations to clusters of hundreds of machines.At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common in many domains. And in addition to batch processing, streaming analysis of new real-time data sources is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications as well.This dissertation proposes an architecture for cluster computing systems that can tackle emerging data processing workloads while coping with larger and larger scales. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping the scalability and fault tolerance of previous systems. And whereas most deployed systems only support simple one-pass computations (e.g., aggregation or SQL queries), ours also extends to the multi-pass algorithms required for more complex analytics (e.g., iterative algorithms for machine learning). Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing, or SQL and complex analytics. We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to efficiently capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using both synthetic 1 benchmarks and real user applications. Spark matches or exceeds the performance of specialized systems in many application domains, while offering stronger fault tolerance guarantees and allowing these workloads to be combined. We explore the generality of RDDs from both a theoretical modeling perspective and a practical perspective to see why this extension can capture a wide range of previously disparate workloads. 2To my family i

show abstract

Continuous analytics over discontinuous streams

Cited by 84 publications

References 7 publications

Low-latency handshake join

Low-latency handshake join

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

An Architecture for Fast and General Data Processing on Large Clusters

Contact Info

Product

Resources

About