An Application of Multivariate Statistical Analysis for Query-Driven Visualization

Gosink, Luke; Garth, Christoph; Anderson, John C.; Bethel, E. Wes; Joy, Ken

doi:10.1109/tvcg.2010.80

Cited by 30 publications

(17 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such complex, aggregate queries typically involve large datasets (which may themselves be the result of linking of other different datasets) and a number of range predicates over multidimensional vectors, structured, semi-and unstructured data. Query-driven data exploration and predictive learning is becoming increasingly important in the presence of large-scale data [7] since predicting aggregations over range predicate queries is a fundamental data exploration task [8] in big data systems. Frequently, data analysts and statisticians are in search of (approximate and/or partial) answers to such queries over unknown data subspaces (knowledge discovery).…”

Section: Introductionmentioning

confidence: 99%

Learning to accurately COUNT with query-driven predictive analytics

Anagnostopoulos

Triantafillou

2015

2015 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Abstract-We study a novel solution to executing aggregation (and specifically COUNT) queries over large-scale data. The proposed solution is generally applicable, in the sense that it can be deployed in environments in which data owners may or may not restrict access to their data and allow only 'aggregation operators' to be executed over their data. For this, it is based on predictive analytics, driven by queries and their results. We propose a machine learning (ML) framework for the task (which can be adapted for different aggregates as well). We focus on the widely used set-cardinality (i.e., COUNT) aggregation operator, as it is a fundamental operator for both internal data system optimisations and for aggregation-query analytics. We contribute a novel, query-driven ML model whose goals are to: (i) learn the query space (access patterns), (ii) associate (complex) aggregation queries with the cardinality of their results, (iii) define query similarity and use it to predict the cardinality of the answer set of an ad-hoc incoming query. Our ML model incorporates incremental learning algorithms for ensuring high prediction accuracy even when both the querying patterns and the underlying data change. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general environments which include restrictedaccess data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for big data analytics, and (iii) offers a performance (in terms of prediction accuracy and time, and memory requirements) that is superior to datacentric approaches. We provide a comprehensive performance evaluation of our model, evaluating its sensitivity and comparative advantages versus acclaimed data-centric methods (self-tuning histograms, sampling, and multidimensional histograms).

show abstract

Section: Introductionmentioning

confidence: 99%

Learning to accurately COUNT with query-driven predictive analytics

Anagnostopoulos

Triantafillou

2015

2015 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

show abstract

“…Techniques such as data subsetting [15][16][17][18][19] and feature identification and tracking [20][21][22][23] have also been well studied.…”

Section: General Compression In Visualizationmentioning

confidence: 99%

Subsampling-based compression and flow visualization

et al. 2015

View full text Add to dashboard Cite

As computational capabilities increasingly outpace disk speeds on leading supercomputers, scientists will, in turn, be increasingly unable to save their simulation data at its native resolution. One solution to this problem is to compress these data sets as they are generated and visualize the compressed results afterwards. We explore this approach, specifically subsampling velocity data and the resulting errors for particle advection-based flow visualization. We compare three techniques: random selection of subsamples, selection at regular locations corresponding to multi-resolution reduction, and introduce a novel technique for informed selection of subsamples. Furthermore, we explore an adaptive system which exchanges the subsampling budget over parallel tasks, to ensure that subsampling occurs at the highest rate in the areas that need it most. We perform supercomputing runs to measure the effectiveness of the selection and adaptation techniques. Overall, we find that adaptation is very effective, and, among selection techniques, our informed selection provides the most accurate results, followed by the multi-resolution selection, and with the worst accuracy coming from random subsamples.

show abstract

“…They visualize the model space together with the data to reveal the trends in the data. Gosink et al [13] use a query-driven visualization with a statistics-based framework. They utilize query distributions to estimate trends and features.…”

Section: Related Workmentioning

confidence: 99%