Optimal Binning for Genomics

Gulino, Andrea; Kaitoua, Abdulrahman; Ceri, Stefano

doi:10.1109/tc.2018.2854880

Cited by 4 publications

(4 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Defining an optimal execution strategy is a classic distributed database problem, which can be solved analytically or by learning the best strategies after several executions. A quantitative model for evaluating the cost of binary operations was already developed in [20]. The three strategies (distributed, centralized, or externalized) discussed in Section 4.2 can be used as starting points for an effective exploration of alternatives.…”

Section: Discussionmentioning

confidence: 99%

“…Specifically, interoperability of region data is guaranteed by the adoption of the Genomic Data Model in GMQL [10], while interoperability of metadata descends from the use of a common conceptual model for genomic sources [18], resulting in the GenoSurf repository [11]. Effective parallel execution of GMQL queries is guaranteed by several physical optimizations, including [19] and [20].…”

Section: Federated Genomic Data Management System Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Federated sharing and processing of genomic datasets for tertiary data analysis

Canakoglu

Pinoli

Gulino

et al. 2020

Briefings in Bioinformatics

Self Cite

View full text Add to dashboard Cite

Abstract Motivation With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. Results A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. Availability The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/ Contact {arif.canakoglu, pietro.pinoli}@polimi.it Summary

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Federated Genomic Data Management System Comparisonmentioning

confidence: 99%

Federated sharing and processing of genomic datasets for tertiary data analysis

Canakoglu

Pinoli

Gulino

et al. 2020

Briefings in Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…GMQL is optimized for sample databases containing many samples, with each sample having a large BED file (Neph et al, 2012) containing tens of thousands to hundreds of thousands of genomic regions. GMQL achieves high performance by binning the genome into chunks and comparing different bins concurrently (Gulino et al, 2018).…”

Section: A Stress Testmentioning

confidence: 99%

Iterating on multiple collections in synchrony

PERNA,

TANNEN,

WONG

2022

J. Funct. Prog.

View full text Add to dashboard Cite

Modern programming languages typically provide some form of comprehension syntax which renders programs manipulating collection types more readable and understandable. However, comprehension syntax corresponds to nested loops in general. There is no simple way of using it to express efficient general synchronized iterations on multiple ordered collections, such as linear-time algorithms for low-selectivity database joins. Synchrony fold is proposed here as a novel characterization of synchronized iteration. Central to this characterization is a monotonicisBeforepredicate for relating the orderings on the two collections being iterated on and an antimonotoniccanSeepredicate for identifying matching pairs in the two collections to synchronize and act on. A restriction is then placed on Synchrony fold, cutting its extensional expressive power to match that of comprehension syntax, giving us Synchrony generator. Synchrony generator retains sufficient intensional expressive power for expressing efficient synchronized iteration on ordered collections. In particular, it is proved to be a natural generalization of the database merge join algorithm, extending the latter to more general database joins. Finally, Synchrony iterator is derived from Synchrony generator as a novel form of iterator. While Synchrony iterator has the same extensional and intensional expressive power as Synchrony generator, the former is better dovetailed with comprehension syntax. Thereby, algorithms requiring synchronized iterations on multiple ordered collections, including those for efficient general database joins, become expressible naturally in comprehension syntax.

show abstract

“…For scalability, join algorithms use binning [11], a partitioning of the genome into segments of equal size, such that each bin is processed in parallel. Optimal binning strategies, discussed in [12], highly improve the join performance, but the scale up is limited due to intrinsic synchronization requirements of the method: contiguous bins may produce replicated regions in the results, their pruning induces a need for data shuffling, and at some point the data shuffling overhead becomes predominant.…”

Section: Introductionmentioning

confidence: 99%

Array-based Data Management for Genomics

Horlova

Kaitoua

Ceri

2020

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of regionpreserving operations, i.e. operations which do not create any new region but rather compose existing ones. In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation.

show abstract

Optimal Binning for Genomics

Cited by 4 publications

References 43 publications

Federated sharing and processing of genomic datasets for tertiary data analysis

Federated sharing and processing of genomic datasets for tertiary data analysis

Iterating on multiple collections in synchrony

Array-based Data Management for Genomics

Contact Info

Product

Resources

About