2019
DOI: 10.1109/tc.2018.2854880
|View full text |Cite
|
Sign up to set email alerts
|

Optimal Binning for Genomics

Abstract: Genome sequencing is expected to be the most prolific source of big data in the next decade; millions of whole genome datasets will open new opportunities for biological research and personalized medicine. Genome sequences are abstracted in the form of interesting regions, describing abnormalities of the genome. The parallel execution on the cloud of complex operations for joining and mapping billions of genomic regions is increasingly important. Genome binning, i.e. partitioning of the genome into small-size … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
2
2

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 43 publications
0
4
0
Order By: Relevance
“…Defining an optimal execution strategy is a classic distributed database problem, which can be solved analytically or by learning the best strategies after several executions. A quantitative model for evaluating the cost of binary operations was already developed in [20]. The three strategies (distributed, centralized, or externalized) discussed in Section 4.2 can be used as starting points for an effective exploration of alternatives.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Defining an optimal execution strategy is a classic distributed database problem, which can be solved analytically or by learning the best strategies after several executions. A quantitative model for evaluating the cost of binary operations was already developed in [20]. The three strategies (distributed, centralized, or externalized) discussed in Section 4.2 can be used as starting points for an effective exploration of alternatives.…”
Section: Discussionmentioning
confidence: 99%
“…Specifically, interoperability of region data is guaranteed by the adoption of the Genomic Data Model in GMQL [10], while interoperability of metadata descends from the use of a common conceptual model for genomic sources [18], resulting in the GenoSurf repository [11]. Effective parallel execution of GMQL queries is guaranteed by several physical optimizations, including [19] and [20].…”
Section: Federated Genomic Data Management System Comparisonmentioning
confidence: 99%
“…GMQL is optimized for sample databases containing many samples, with each sample having a large BED file (Neph et al, 2012) containing tens of thousands to hundreds of thousands of genomic regions. GMQL achieves high performance by binning the genome into chunks and comparing different bins concurrently (Gulino et al, 2018).…”
Section: A Stress Testmentioning
confidence: 99%
“…For scalability, join algorithms use binning [11], a partitioning of the genome into segments of equal size, such that each bin is processed in parallel. Optimal binning strategies, discussed in [12], highly improve the join performance, but the scale up is limited due to intrinsic synchronization requirements of the method: contiguous bins may produce replicated regions in the results, their pruning induces a need for data shuffling, and at some point the data shuffling overhead becomes predominant.…”
Section: Introductionmentioning
confidence: 99%