No abstract
Many of the data sources used in stream query processing are known to exhibit bursty behavior. Data in a burst often has different characteristics than steady-state data, and therefore may be of particular interest. In this paper, we describe the Data Triage architecture that we are adding to TelegraphCQ to provide low latency results with good accuracy under such bursts. BackgroundOne of the distinguishing properties of stream query processors is that they produce query results in real time. For applications like financial market analysis, network monitoring, and inventory tracking, timely query results are of great importance.Studies show that common sources of streaming datanetwork traffic, environmental monitoring, software logs, etc. -often exhibit "bursty" behavior [1] [2]. Bursty behavior is characterized by periods of low data rates punctuated by "bursts" of high data rates that vary in their length and speed. Available network bandwidth and incoming query workloads may also be affected during periods of bursts, leading to a situation in which the effective load on a stream query processor can vary rapidly and unpredictably by orders of magnitude.Note that bursts often produce not only more data, but also different data than usual. This will often be the case, for example, in crisis scenarios (network attacks, environmental incidents, software malfunctions, etc.), where a high volume of unusual readings may be reported to the system. Hence, analysts may be particularly eager to capture the properties of the data in the burst.The requirement for low result latency under heavy load raises design challenges, since query processors must return useful results quickly regardless of the rate at which they receive data. Much recent work has focused on methods of coping with excessive data rates in streaming query processors by shedding load. We refer the reader to our technical Output Buffers Local Data Stream Network Remote Data Stream Synopsize Overflow Triage Queue Query Engine Wrapper Clearinghouse Key Tuples Synopses Triage Queue Remote Wrapper Synopsize Figure 1. The Data Triage load-shedding architecture. We embed triage queues inside the gateway modules that convert data streams into the system's internal format. If the query engine cannot consume tuples at the rate they enter the triage queues, the system builds synopses of the excess tuples.report [3] for a more in-depth description of previous work.We believe that bursty data arrival poses unique challenges for load-shedding that have not been adequately addressed in previous work. Since bursts can occur suddenly, load-shedding mechanisms need to react quickly to changes in data rates. Due to the very high variation in bandwidth exhibited by bursty data sources, load-shedding mechanisms need to produce accurate query results with low latency across a wide range of system loads. Finally, because bursts may contain the most interesting information, load-shedding should not simply discard excess data; it must capture properties of the missing data. Archit...
Computer architectures are increasingly based on multi-core CPUs and large memories. Memory bandwidth, which has not kept pace with the increasing number of cores, has become the primary processing bottleneck, replacing disk I/O as the limiting factor. To address this challenge, we provide novel algorithms for increasing the throughput of Business Intelligence (BI) queries, as well as for ensuring fairness and avoiding starvation among a concurrent set of such queries. To maximize throughput, we propose a novel FullSharing scheme that allows all concurrent queries, when performing base-table I/O, to share the cache belonging to a given core. We then generalize this approach to a BatchSharing scheme that avoids thrashing on "agg-tables"-hash tables that are used for aggregation processing-caused by execution of too many queries on a core. This scheme partitions queries into batches such that the working-set of agg-table entries for each batch can fit into a cache; an efficient sampling technique is used to estimate selectivities and working-set sizes for purposes of query partitioning. Finally, we use lottery-scheduling techniques to ensure fairness and impose a hard upper bound on staging time to avoid starvation. On our 8-core testbed, we were able to completely remove the memory I/O bottleneck, increasing throughput by a factor of 2 to 2.5, while also maintaining fairness and avoiding starvation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.