FastQuery: A Parallel Indexing System for Scientific Data

Chou, Jerry; Wu, Kesheng; Prabhat,

doi:10.1109/cluster.2011.86

Cited by 45 publications

(30 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, PARLO addresses heterogeneous access patterns induced by a range of general query types. Unlike prior post-processing approaches [5], [25], PARLO is integrated with parallel I/O middleware to achieve efficient run-time in-memory layout optimization and index building.…”

Section: Run-time Layout Optimization Performance Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

PARLO: PArallel Run-Time Layout Optimization for Scientific Data Explorations with Heterogeneous Access Patterns

Gong

Boyuka

Zou

et al. 2013

2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

View full text Add to dashboard Cite

The size and scope of cutting-edge scientific simulations are growing much faster than the I/O and storage capabilities of their run-time environments. The growing gap is exacerbated by exploratory, data-intensive analytics, such as querying simulation data with multivariate, spatio-temporal constraints, which induces heterogeneous access patterns that stress the performance of the underlying storage system. Previous work addresses data layout and indexing techniques to improve query performance for a single access pattern, which is not sufficient for complex analytics jobs. We present PARLO a parallel run-time layout optimization framework, to achieve multi-level data layout optimization for scientific applications at run-time before data is written to storage. The layout schemes optimize for heterogeneous access patterns with user-specified priorities. PARLO is integrated with ADIOS, a high-performance parallel I/O middleware for large-scale HPC applications, to achieve user-transparent, light-weight layout optimization for scientific datasets. It offers simple XML-based configuration for users to achieve flexible layout optimization without the need to modify or recompile application codes. Experiments show that PARLO improves performance by 2 to 26 times for queries with heterogeneous access patterns compared to state-of-the-art scientific database management systems. Compared to traditional post-processing approaches, its underlying run-time layout optimization achieves a 56% savings in processing time and a reduction in storage overhead of up to 50%. PARLO also exhibits a low run-time resource requirement, while also limiting the performance impact on running applications to a reasonable level.

show abstract

Section: Run-time Layout Optimization Performance Evaluationmentioning

confidence: 99%

“…For instance, SciDB [2] and work on space-filling curves (SFC) [17] focus on spatial LO. Likewise, FastBit [5], [25] and ISABELA-QA [16] explore value-based LO methods. However, systems optimized for only a single access pattern cannot address the mix of access patterns observed in practice.…”

Section: Introductionmentioning

confidence: 99%

PARLO: PArallel Run-Time Layout Optimization for Scientific Data Explorations with Heterogeneous Access Patterns

Gong

Boyuka

Zou

et al. 2013

2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

View full text Add to dashboard Cite

show abstract

“…In this work, we use FastQuery [8], [6], [7] to accelerate the data analysis process of the trillion particle dataset. Here, we briefly recap the salient features of FastQuery, and elaborate on the new hybrid parallel implementation.…”

Section: B Indexing/querying With Hybrid Parallel Fastquerymentioning

confidence: 99%

Parallel I/O, analysis, and visualization of a trillion particle simulation

Byna

Chou²,

Rübel

et al. 2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Abstract-Petascale plasma physics simulations have recently entered the regime of simulating trillions of particles. These unprecedented simulations generate massive amounts of data, posing significant challenges in storage, analysis, and visualization. In this paper, we present parallel I/O, analysis, and visualization results from a VPIC trillion particle simulation running on 120,000 cores, which produces ∼ 30T B of data for a single timestep. We demonstrate the successful application of H5Part, a particle data extension of parallel HDF5, for writing the dataset at a significant fraction of system peak I/O rates. To enable efficient analysis, we develop hybrid parallel FastQuery to index and query data using multi-core CPUs on distributed memory hardware. We show good scalability results for the FastQuery implementation using up to 10,000 cores. Finally, we apply this indexing/query-driven approach to facilitate the firstever analysis and visualization of the trillion particle dataset.

show abstract

“…Supposing a floating-point number with a very extensive domain, the use of inequalities can be more suitable to generate indexes (less number of bitmap columns) and query certain intervals (of values) of the attribute indexed (search space limited by the generated indexes). FastBit tool, FastQuery [29] and SDS/Q framework are examples of related work that employ bitmap indexing. However, none of these solutions can manage data element through dataflow generation.…”

Section: Indexing Raw Data From Filesmentioning

confidence: 99%

Raw data queries during data-intensive parallel workflow execution

Silva

Leite

Camata

et al. 2017

Future Generation Computer Systems

View full text Add to dashboard Cite

Computer simulations consume and produce huge amounts of raw data files presented in different formats, e.g., HDF5 in computational fluid dynamics simulations. Users often need to analyze domain-specific data based on related data elements from multiple files during the execution of computer simulations. In a raw data analysis, one should identify regions of interest in the data space and retrieve the content of specific related raw data files. Existing solutions, such as FastBit and RAW, are limited to a single raw data file analysis and can only be used after the execution of computer simulations. Scientific Workflow Management Systems (SWMS) can manage the dataflow of computer simulations and register related raw data files at a provenance database. This paper aims to combine the advantages of a dataflow-aware SWMS and the raw data file analysis techniques to allow for queries on raw data file elements that are related, but reside in separate files. We propose a component-based architecture, named as ARMFUL (Analysis of Raw data from Multiple Files) with raw data extraction and indexing techniques, which allows for a direct access to specific elements or regions of raw data space. ARMFUL innovates by using a SWMS provenance database to add a dataflow access path to raw data files. ARMFUL facilitates the invocation of ad-hoc programs and third party tools (e.g., FastBit tool) for raw data analyses. In our experiments, a real parallel computational fluid dynamics is executed, exploring different alternatives of raw data extraction, indexing and analysis.

show abstract

FastQuery: A Parallel Indexing System for Scientific Data

Cited by 45 publications

References 21 publications

PARLO: PArallel Run-Time Layout Optimization for Scientific Data Explorations with Heterogeneous Access Patterns

PARLO: PArallel Run-Time Layout Optimization for Scientific Data Explorations with Heterogeneous Access Patterns

Parallel I/O, analysis, and visualization of a trillion particle simulation

Raw data queries during data-intensive parallel workflow execution

Contact Info

Product

Resources

About