Hierarchically organized skew-tolerant histograms for geographic data objects

Roh, Young Jun; Kim, Jae Ho; Chung, Yon Dohn; Son, Jun Ho; Kim, Myoung Ho

doi:10.1145/1807167.1807236

Cited by 16 publications

(13 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We vary the number of histogram buckets from 50 to 250 like most other researchers do [3], [24], [27], [29].…”

Section: Methodsmentioning

confidence: 99%

“…A cardinality estimate which is close to the real cardinality enables the optimizer to accurately estimate the costs of different plans, and to choose a good plan. Therefore, the quality of a histogram is conventionally measured by the error the histogram produces over a series of queries [3], [12], [24], [27], [29]. Given a workload W and histogram H, the Mean Absolute Error is:…”

Section: Methodsmentioning

confidence: 99%

“…They are adaptive to query patterns of the user, and stay up-to-date to the data, i.e., unlike static histograms, one does not need to re-build them regularly. As one representative, we consider the data structure of STHoles [3], which is very flexible and has been used in several other histograms [13], [24], [27]. STHoles tries to find rectangular regions in the dataspace which have close to uniform density.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering

Khachatryan¹,

Müller

Stier

et al. 2015

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

In large databases, the amount and the complexity of the data calls for data summarization techniques. Such summaries are used to assist fast approximate query answering or query optimization. Histograms are a prominent class of model-free data summaries and are widely used in database systems. So-called self-tuning histograms look at query-execution results to refine themselves. An assumption with such histograms, which has not been questioned so far, is that they can learn the dataset from scratch, that is-starting with an empty bucket configuration. We show that this is not the case. Self-tuning methods are very sensitive to the initial configuration. Three major problems stem from this. Traditional self-tuning is unable to learn projections of multi-dimensional data, is sensitive to the order of queries, and reaches only local optima with high estimation errors. We show how to improve a self-tuning method significantly by starting with a carefully chosen initial configuration. We propose initialization by dense subspace clusters in projections of the data, which improves both accuracy and robustness of self-tuning. Our experiments on different datasets show that the error rate is typically halved compared to the uninitialized version.

show abstract

“…We vary the number of histogram buckets from 50 to 250 like most other researchers do [3], [24], [27], [29].…”

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering

Khachatryan¹,

Müller

Stier

et al. 2015

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…Our compression technique can be viewed as a bottom-up approach for building a histogram, as it proceeds by progressively aggregating pairs of tuples (starting from the original ones), and the final aggregate tuples can be viewed as buckets storing aggregate information on the original tuples merged into them. However, it is worth noting that the above-mentioned histogramconstruction techniques (as well as more recent proposals [9,16,15,17,23,25,32]) cannot be easily extended to deal with our setting, In fact, these techniques are guided only by the measure values associated with the points that will be aggregated into buckets, and do not take into account any precedence (temporal) relationship between points. This means that they are not able to construct a histogram from which the structure of the processes can be re-composed with no loss.…”

Section: Related Workmentioning

confidence: 99%

A compression-based framework for the efficient analysis of business process logs

Fazzinga¹,

Flesca

Furfaro

et al. 2015

Proceedings of the 27th International Conference on Scientific and Statistical Database Management

View full text Add to dashboard Cite

The increasing availability of large process log repositories calls for efficient solutions for their analysis. In this regard, a novel specialized compression technique for process logs is proposed, that builds a synopsis supporting a fast estimation of aggregate queries, which are of crucial importance in exploratory and high-level analysis tasks. The synopsis is constructed by progressively merging the original log-tuples, which represent single activity executions within the process instances, into aggregate tuples, summarizing sets of activity executions. The compression strategy is guided by a heuristic aiming at limiting the loss of information caused by summarization, while guaranteeing that no information is lost on the set of activities performed within the process instances and on the order among their executions. The selection conditions in an aggregate query are specified in terms of a graph pattern, that allows precedence relationships over activity executions to be expressed, along with conditions on their starting times, durations, and executors. The efficacy of the compression technique, in terms of capability of reducing the size of the log and of accuracy of the estimates retrieved from the synopsis, has been experimentally validated.

show abstract

“…Once again, this method uses a rectangular grid as a starting point thus making it dependent on the initial grid resolution. STHist [31] applies the idea of GenHist to 2D and 3D spatial objects. In the basic algorithm, dense regions are determined by applying a sliding window over each dimension, approximating the frequency distribution with a marginal distribution.…”

Section: Related Workmentioning

confidence: 99%

Statistics collection in oracle spatial and graph

Bamba

Ravada

et al. 2013

Proc. VLDB Endow.

View full text Add to dashboard Cite

Oracle Spatial and Graph is a geographic information system (GIS) which provides users the ability to store spatial data alongside conventional data in Oracle. As a result of the coexistence of spatial and other data, we observe a trend towards users performing increasingly complex queries which involve spatial as well as non-spatial predicates. Accurate selectivity values, especially for queries with multiple predicates requiring joins among numerous tables, are essential for the database optimizer to determine a good execution plan. For queries involving spatial predicates, this requires that reasonably accurate statistics collection has been performed on the spatial data. For extensible data cartridges such as Oracle Spatial and Graph, the optimizer expects to receive accurate predicate selectivity and cost values from functions implemented within the data cartridge. Although statistics collection for spatial data has been researched in academia for a few years; to the best of our knowledge, this is the first work to present spatial statistics collection implementation details for a commercial GIS database. In this paper, we describe our experiences with implementation of statistics collection methods for complex geometry objects within Oracle Spatial and Graph. Firstly, we exemplify issues with previous partitioning-based algorithms in presence of complex geometry objects and suggest enhancements which resolve the issues. Secondly, we propose a main memory implementation which not only speeds up the disk-based partitioning algorithms but also utilizes existing R-tree indexes to provide surprisingly accurate selectivity estimates. Last but not the least, we provide extensive experimental results and an example study which displays the efficacy of our approach on Oracle query performance.

show abstract

Hierarchically organized skew-tolerant histograms for geographic data objects

Cited by 16 publications

References 33 publications

Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering

Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering

A compression-based framework for the efficient analysis of business process logs

Statistics collection in oracle spatial and graph

Contact Info

Product

Resources

About