On random sampling over joins

Chaudhuri, Surajit; Motwani, Rajeev; Narasayya, Vivek

doi:10.1145/304182.304206

Cited by 193 publications

(176 citation statements)

References 11 publications

Supporting

Mentioning

173

Contrasting

Unclassified

Order By: Relevance

“…The other method uses the packets arrival times (which can be used to anticipate a traffic burst) together with knowledge about the processing time required to process a sample. [23] proposes to use the least squares estimate and a certain set of heuristic rules to determine the sampling rate. The authors in [24] describe a flow sampling approach, which allows controlling the expected volume of samples and minimizes the variance of the estimates.…”

Section: Adaptive Samplingmentioning

confidence: 99%

Intelligent Network Client Profiler

Arsénio¹

2013

JACN

View full text Add to dashboard Cite

Abstract-Peer2Peer traffic already accounts for a large share of the overall internet traffic. Future solutions will need to manage all the available resources in order to charge users using fair rules according to their communication profile. Obtaining information about the behavior of Internet traffic is therefore fundamental to the management, monitoring and operation activities, such as the identification of applications and protocols that customers use. However, the main obstacle to this identification is the lack of scalability for monitoring network devices. In particular, they can analyze all the network packets for this purpose. This task is extremely demanding and almost impossible to accomplish rapidly in large networks (because usually there is a number in the hundreds or thousands of customers). Furthermore, we expect such networks to become even larger, as on the internet of things all devices (sensors, appliances, etc) will be publicly connected to the internet. As such, traffic sampling strategies have been proposed to overcome this major problem of scale. This paper presents different works in the area of monitoring traffic for user profiling and security purposes. It proposes as well a solution that uses selective filtering techniques combined with an engine traffic DPI to identify applications and protocols that customers use most frequently. Thus it becomes possible to get ISPs to optimize their network in a scalable and intelligent manner, imposing security measures in order to enforce network usage according to client profiles.

show abstract

Section: Adaptive Samplingmentioning

confidence: 99%

Intelligent Network Client Profiler

Arsénio¹

2013

JACN

View full text Add to dashboard Cite

show abstract

“…For join queries that access attributes from multiple datasets R 1 , ..., R l it is conceivable to construct a result approximation or result size estimation from multiple synopses. On the other hand, it is known that this approach may lead to unbounded approximation errors [5]. Therefore, we have adopted the approach of [1] to use special join synopses for this purpose.…”

Section: Frameworkmentioning

confidence: 99%

“…As pointed out in [5] (in the context of sampling), it is usually not feasible to estimate arbitrary join queries from approximations of the joining base relations with acceptable accuracy. For sampling, this phenomenon is discussed extensively in [5], but it does also hold for all other data reduction techniques that estimate join queries from approximations of the base relations.…”

Section: Join Synopsesmentioning

confidence: 99%

“…For sampling, this phenomenon is discussed extensively in [5], but it does also hold for all other data reduction techniques that estimate join queries from approximations of the base relations.…”

Section: Join Synopsesmentioning

confidence: 99%

“…This broad importance of statistics management has led to a plethora of approximation techniques, for which [11] have coined the general term "data synopses": advanced forms of histograms [24,12,16], spline synopses [18,19], sampling [5,13,10], and parametric curve-fitting techniques [27,7] all the way to highly sophisticated methods based on kernel estimators [2] or Wavelets and other transforms [22,21,3]. However, most of these techniques take the local viewpoint of optimizing the approximation error for a single data distribution such as one database table with preselected relevant attributes.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Framework for the Physical Design Problem for Data Synopses

König

Weikum

2002

Advances in Database Technology — EDBT 2002

View full text Add to dashboard Cite

Abstract. Maintaining statistics on multidimensional data distributions is crucial for predicting the run-time and result size of queries and data analysis tasks with acceptable accuracy. To this end a plethora of techniques have been proposed for maintaining a compact data "synopsis" on a single table, ranging from variants of histograms to methods based on wavelets and other transforms. However, the fundamental question of how to reconcile the synopses for large information sources with many tables has been largely unexplored. This paper develops a general framework for reconciling the synopses on many tables, which may come from different information sources. It shows how to compute the optimal combination of synopses for a given workload and a limited amount of available memory. The practicality of the approach and the accuracy of the proposed heuristics are demonstrated by experiments.

show abstract