Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data

Masseroli, Marco; Canakoglu, Arif; Pinoli, Pietro; Kaitoua, Abdulrahman; Gulino, Andrea; Horlova, Olha; Nanni, Luca; Bernasconi, Anna; Perna, Stefano; Stamoulakatou, Eirini; Ceri, Stefano

doi:10.1093/bioinformatics/bty688

Cited by 51 publications

(60 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We built an exploration mechanism for supporting semantic queries upon our Genomic Knowledge Graph; we demonstrated the effectiveness of our approach through four examples which are representative of the use of our query interface. Our repository is already storing data coming from eight data sources of genomic data, including datasets relevant for epigenomics, gene expression data, mutation data, deployed in conjunction with an advanced genomic data manager [9], available at http://gmql.eu/gmql-rest/).…”

Section: Discussionmentioning

confidence: 99%

From a Conceptual Model to a Knowledge Graph for Genomic Datasets

Bernasconi

Canakoglu

Ceri

2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Data access at genomic repositories is problematic, as data is described by heterogeneous and hardly comparable metadata. We previously introduced a unified conceptual schema, collected metadata in a single repository and provided classical search methods upon them. We here propose a new paradigm to support semantic search of integrated genomic metadata, based on the Genomic Knowledge Graph, a semantic graph of genomic terms and concepts, which combines the original information provided by each source with curated terminological content from specialized ontologies. Commercial knowledge-assisted search is designed for transparently supporting keyword-based search without explaining inferences; in biology, inference understanding is instead critical. For this reason, we propose a graph-based visual search for data exploration; some expert users can navigate the semantic graph along the conceptual schema, enriched with simple forms of homonyms and term hierarchies, thus understanding the semantic reasoning behind query results.

show abstract

Section: Discussionmentioning

confidence: 99%

From a Conceptual Model to a Knowledge Graph for Genomic Datasets

Bernasconi

Canakoglu

Ceri

2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…The latter is a cloud-based data manager for region-based data, supporting a new query language for genomics, called GenoMetric Query Language, GMQL [15]. The language derives from classical abstractions of relational databases and is the composition of orthogonal operations, which apply to either one or two datasets.…”

Section: Geco Resourcesmentioning

confidence: 99%

“…The associated GMQL query system [15] has a modular architecture including an intermediate representation supporting operations over regions and metadata which are executed by the Apache Spark engine, a data frameworks on the cloud that proved to be extremely efficient in supporting Fig. 2 First two components of the PCA on the two selected datasets form the Curated Ovarian Datasets highlights strong batch effects, that hinders the integration of the two massive genomic queries [16], with a high-level technology-independent repository abstraction, supporting different repository types (e.g., local file system, Hadoop File System, or others), several system interfaces, including an intuitive public Web-based interface, 2 as well as two programmatic interfaces: a pyGMQL library for Python 3 and a RGMQL package 4 for the R/Bioconductor environment.…”

Section: Geco Resourcesmentioning

confidence: 99%

Data Science for Genomic Data Management: Challenges, Resources, Experiences

Ceri

Pinoli

2019

SN COMPUT. SCI.

Self Cite

View full text Add to dashboard Cite

We highlight several challenges which are faced by data scientists who use public datasets for solving biological and clinical problems. In spite of the large efforts in building such public datasets, they are dispersed over many sources and heterogeneous for their formats and sequencing/calling techniques, often meeting highly variable quality standards. Moreover, for most research questions, scientists hardly find datasets with enough samples for building and training machine learning models. Data scarcity depends on the complexity of the genomic domain, with its multi-facets, as well as the lack of organic initiatives to provide standardization across communities. In this paper, we discuss our approach to genomic data management, that can strongly improve the problems of data dispersion and format heterogeneity through high-level abstractions for genomics. We briefly present the computational resources that were recently developed by the GeCo project (ERC Advanced Grant); they include GDM, a Genomic Data Model providing interoperability across data formats; GMQL, a genometric query language for answering data science queries over genomic datasets; and an in-house integrated repository providing attribute-based and keyword-based search over normalized metadata from several open data repositories. We describe these resources at work on a specific research question, and we highlight how we managed to produce a model for addressing such specific research question by overcoming the lack of sufficient samples and labelled datasets.

show abstract

“…We downloaded the 33 ENCODE CTCF Narrow Peak tracks (Table S1) from the UCSC Browser 1 . For each CTCF binding site we then associate its enrichment signal for each of the Chip-seq tracks (using the map operation of PyGMQL (Masseroli et al, 2019). Before aggregating the 33 signal values for every CTCF binding site, we assessed the value distribution of every CTCF Chip-seq experiment and found heterogeneous distributions across cell lines, lineages and laboratories.…”

Section: Assigning Scores To Ctcf Binding Sitesmentioning

confidence: 99%

The CTCF Anatomy of Topologically Associating Domains

Nanni

Wang

Manders

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Topologically associated domains (TADs) are defined as regions of self-interaction. To date, it is unclear how to reconcile TAD structure with CTCF site orientation, which is known to coordinate chromatin loops anchored by Cohesin rings at convergent CTCF site pairs. We first approached this problem by 4C analysis of the FKBP5 locus. This uncovered a CTCF loop encompassing FKBP5 but not its entire TAD. However, adjacent CTCF sites were able to form 'back-up' loops and these were located at TAD boundaries. We then analysed the spatial distribution of CTCF patterns along the genome together with a boundary identity conservation 'gradient' obtained from primary blood cells. This revealed that divergent CTCF sites are enriched at boundaries and that convergent CTCF sites mark the interior of TADs. This conciliation of CTCF site orientation and TAD structure has deep implications for the further study and engineering of TADs and their boundaries.

show abstract

Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data

Abstract: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/.

Cited by 51 publications

References 40 publications

From a Conceptual Model to a Knowledge Graph for Genomic Datasets

From a Conceptual Model to a Knowledge Graph for Genomic Datasets

Data Science for Genomic Data Management: Challenges, Resources, Experiences

The CTCF Anatomy of Topologically Associating Domains

Contact Info

Product

Resources

About