F1 query

Samwel, Bart; Cieslewicz, John; Handy, Ben; Govig, Jason; Venetis, Petros; Yang, Chanjun; Peters, Keith; Shute, Jeff; Tenedorio, Daniel; Apte, Himani; Weigel, Felix; Wilhite, David; Yang, Jiacheng; Xu, Jun; Li, Jiexing; Yuan, Zhan; Chasseur, Craig; Zeng, Qiang; Rae, Ian; Biyani, Anurag; Harn, Andrew; Yang, Xiao; Gubichev, Andrey; El-Helw, Amr; Erling, Orri; Yan, Zhepeng; Yang, Mohan; Wei, Yiqun; Nho, Thanh; Zheng, Colin; Graefe, Goetz; Sardashti, Somayeh; Aly, A. A.; Agrawal, Divy; Gupta, Ashish; Venkataraman, Shiv

doi:10.14778/3229863.3229871

Cited by 33 publications

(9 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With input and output sizes many times larger than the available memory, the cost of the necessary join lookup and fetch for the materialization depends on the cost of random I/Os [6,12]. Local NVM and SSD storage could provide efficient random reads; in our environment, however, storage is disaggregated and handled by servers separate from the ones executing the query logic [29]. The cost of an I/O is a network round trip, plus the invocation of the storage service, plus an I/O in a shared and busy disk drive.…”

Section: Top-k Execution Strategiesmentioning

confidence: 99%

“…Externally sorting the entire input is an expensive operation and results in unpleasant user experience as the execution of a top-k query exhibits a performance cliff; namely the sudden and drastic change in the execution cost when the output exceeds the memory capacity. An analysis of our production query logs showed that, on an average day, F1 Query [29] executes tens of thousands of top-k queries that resort to an external sort of the entire input. We observe that it is very common for top-k queries to use secondary storage, due to high contention for main memory resources or simply because of large requested outputs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

External Merge Sort for Top-K Queries

Chronis

Nho

Graefe

et al. 2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

Business intelligence and web log analysis workloads often use queries with top-k clauses to produce the most relevant results. Values of k range from small to rather large and sometimes the requested output exceeds the capacity of the available main memory. When the requested output fits in the available memory existing top-k algorithms are efficient, as they can eliminate almost all but the top k results before sorting them. When the requested output exceeds the main memory capacity, existing algorithms externally sort the entire input, which can be very expensive. Furthermore, the drastic difference in execution cost when the memory capacity is exceeded results in an unpleasant user experience. Every day, tens of thousands of production top-k queries executed on F1 Query resort to an external sort of the input. To address these challenges, we introduce a new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory. To achieve this, at execution time our algorithm creates a concise model of the input using histograms. The proposed algorithm is implemented as part of F1 Query and is used in production, where significantly accelerates top-k queries with outputs larger than the available memory. We evaluate our algorithm against existing top-k algorithms and show that it reduces I/O traffic and can be up to 11× faster. * Work done while at Google Inc.

show abstract

Section: Top-k Execution Strategiesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

External Merge Sort for Top-K Queries

Chronis

Nho

Graefe

et al. 2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…A popular way to avoid expensive remote join operations-already used in early parallel systems-is to co-partition tables on their join key [22,26]. Generalizations of the latter technique where co-partitioning is determined by more complex join predicates have been shown to be effective in modern systems as well [38,39,41,45].…”

Section: Co-partitioningmentioning

confidence: 99%

“…Recent database systems like Google's F1 [39,41] use hierarchical partitioning schemes to provide performance while ensuring consistency under updates. Hierarchical partitioning is a variant of the co-partitioning approach [42], introduced as predicate-based reference partitioning [45].…”

Section: Hierarchical Partitioning Schemesmentioning

confidence: 99%

“…These placement strategies often require a reshuffling of the data for each binary join in the processed query which are commonly based on a range or hash partitioning of the relevant attributes. Recently, however, more elaborated schemes of data placement like co-partitioning, single hypercubes (for multiwayjoins) or multiple hypercubes (for skewed data) gained some attention [3,13,30,39,41,45].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Distribution Constraints: The Chase for Distributed Data

Geck¹,

Neven²,

Schwentick³

2020

Preprint

View full text Add to dashboard Cite

This paper introduces a declarative framework to specify and reason about distributions of data over computing nodes in a distributed setting. More specifically, it proposes distribution constraints which are tuple and equality generating dependencies (tgds and egds) extended with node variables ranging over computing nodes. In particular, they can express co-partitioning constraints and constraints about range-based data distributions by using comparison atoms. The main technical contribution is the study of the implication problem of distribution constraints. While implication is undecidable in general, relevant fragments of so-called data-full constraints are exhibited for which the corresponding implication problems are complete for EXPTIME, PSPACE and NP. These results yield bounds on deciding parallel-correctness for conjunctive queries in the presence of distribution constraints.

show abstract

A Roadmap for HEP Software and Computing R&D for the 2020s

Albrecht¹,

Alves²,

Amádio³

et al. 2019

Comput Softw Big Sci

138

View full text Add to dashboard Cite

Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.

show abstract

F1 query

Cited by 33 publications

References 50 publications

External Merge Sort for Top-K Queries

External Merge Sort for Top-K Queries

Distribution Constraints: The Chase for Distributed Data

A Roadmap for HEP Software and Computing R&D for the 2020s

Contact Info

Product

Resources

About