Supporting Data Integration Tasks with Semi-Automatic Ontology Construction

Touma, Rizkallah; Romero, Óscar; Jovanovic, Petar

doi:10.1145/2811222.2811228

Cited by 17 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Those provide an overview of techniques, algorithms and approaches to extract schemas, match- ing schemas, and finding patterns in the data content of data files [13], [16]. There is also research to detect cross-data relationships which aim at detecting similar data files with similar informational concepts [15], [17].…”

Section: Related Workmentioning

confidence: 99%

“…Ontology alignment and schema matching techniques which are based on finding similarity between data schemas and instances of data can also be utilised to integrate datasets [5]. This can be achieved by extracting the schema and ontology from the data and then applying the matching techniques [16].…”

Section: Related Workmentioning

confidence: 99%

“…The first phase is data ingestion which includes discovering the new data by its provenance metadata, parsing the data, extracting the schema of the data similar to [16], and storing the data in the DL with its annotated schema metadata. The second phase is data digestion which means analysing the data flowing to the DL to discover informational concepts and data features.…”

Section: A Framework For Content Metadata Managementmentioning

confidence: 99%

“…The dataset is then analysed in activity ING03 to extract and annotate the schema semantics in O(m) time, where m is the number of attributes. This is done using RDF ontology extraction techniques like in [16]. The generated metadata is stored in a semantic-aware metadata repository (i.e.…”

Section: A Framework For Content Metadata Managementmentioning

confidence: 99%

“…In DIG03, the dataset and its profiles are compared to other datasets and their profiles using ontology alignment techniques, which requires O(m 2 ) in worst case scenario [11]. We propose an algorithm to reduce this complexity in Section V. There should be certain cut-off thresholds of schema similarity (like [13], [16], [17]) and data profile similarity [12] to indicate whether to align two datasets together, in order to decrease the number of comparisons made in this activity. Ontology alignment is used to extract metadata about the relationships with other datasets.…”

Section: A Framework For Content Metadata Managementmentioning

confidence: 99%

See 4 more Smart Citations

Towards Information Profiling: Data Lake Content Metadata Management

Alserafi

Abelló

Romero

et al. 2016

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)

Self Cite

View full text Add to dashboard Cite

Abstract-There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this. We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach. I. INTRODUCTIONThere is currently a huge growth in the amount, variety, and velocity of data ingested in analytical data repositories. Such data are commonly called Big Data (BD). Data repositories storing such BD in their original raw-format are commonly called Data Lakes (DL) [1]. DL are characterised by having a large amount of data covering different subjects, which need to be analysed by non-experts in IT commonly called data enthusiasts [2]. To support the data enthusiast in analysing the data in the DL, there must be a data governance process which describes the content using metadata. Such process should describe the informational content of the data ingested using the least intrusive techniques. The metadata can then be exploited by the data enthusiast to discover relationships between datasets, duplicated data, and outliers which have no other datasets related to them.In this paper, we investigate the appropriate process and techniques required to manage the metadata about the informational content of the DL. We specifically focus on addressing the challenges of variety and variability of BD ingested in the DL. The metadata discovered supports data consumers in finding the required data in the large amounts of information stored inside the DL for analytical purposes [3]. Currently, information discovery to identify, locate, integrate and reengineer data consumes 70% of time spent in data analytics project [1], which clearly needs to be decreased. To handle this challenge, this paper proposes (i) a systematic process for the schema annotation of data ingested in the DL and

show abstract