SemLinker: automating big data integration for casual users

Alrehamy, Hassan H.; Walker, Coral

doi:10.1186/s40537-018-0123-x

Cited by 8 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During our analysis, we mapped who the authors of the papers references when using a definition for data lakes. We found that James Dixon was the first one to use the term lake in big data context, in a post in its blog in 2010 [20], and he is referenced by ten papers [4], [6], [17], [32], [38], [44], [62], [63], [67], [91]. The first author to reference Dixon's Concept in academic context was O'Leary [63], in a paper published in 2014.…”

Section: Resultsmentioning

confidence: 99%

“…Initial Accepted Scopus 108 53 papers: [1]- [3], [5], [9], [10], [13]- [19], [23]- [29], [31]- [33], [37], [40], [45], [49], [50], [57], [60]- [66], [68], [70], [71], [73], [76]- [78], [81]- [84], [88], [90], [91], [93]- [95] Springer 222 20 papers: [4], [6], [12], [21], [30], [36], [38], [39], [41]- [43], [47], [51], [53], [69], [74], [79], [85], [86], [92] Google Scholar 197 6 papers:...…”

Section: Sourcementioning

confidence: 99%

See 1 more Smart Citation

A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures

Couto¹,

Borges²,

Ruiz³

et al. 2019

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

In the past few years, data lakes emerged as a trending topic in big data technologies. Although literature presents different points of view related to its functionalities, it serves mainly to store a variety of data in a big data context. In this paper, we aim to identify and analyze data lake definitions and possible architectures. Our methodology was composed of a systematic literature mapping based on PRISMA, software engineering best practices to perform reviews, and Kappa method to assess results' quality. We performed the search in eight different electronic databases to achieve a wide variety of publishers in Computer Science. We first identified 662 papers matching our search criteria; after filtering, we selected 87 papers for review. We found that the term data lakes was first defined by James Dixon in 2010. We also found that the term is often related to raw data repositories. From the identified definitions, we propose a new one as a means to better state what data lakes refer to and improve how the community use them. Moreover, we foind that Hadoop and its ecosystem compose the most used toolset to create data lakes, revealing that this is the mainstream in architectures for data lakes as of today's available technologies.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Sourcementioning

confidence: 99%

A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures

Couto¹,

Borges²,

Ruiz³

et al. 2019

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

show abstract

“…Although Data Civilizer has a similar scope and objectives to VADA, typically users have a greater involvement with the individual data preparation steps, for example though mapping [39] or workflow [40] construction, so the emphasis is more on supporting developers in creating ETL flows than on the more fully automated approach being explored here. Building instead on semantic web technologies, SemLinker [41] extracts a graph of source data features, which are then aligned with a global ontology. Here the emphasis is on providing a consistent route into the data sets in a personal data lake, using plugins where necessary to provide more specialised processing for particular domains or data types.…”

Section: Discussionmentioning

confidence: 99%

VADA: an architecture for end user informed data preparation

et al. 2019

View full text Add to dashboard Cite

Background: Data scientists spend considerable amounts of time preparing data for analysis. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden. Results: This paper presents an architecture in which the data scientist need only describe the intended outcome of the data preparation process, leaving the software to determine how best to bring about the outcome. Key wrangling decisions on matching, mapping generation, mapping selection, format transformation and data repair are taken by the system, and the user need only provide: (i) the schema of the data target; (ii) partial representative instance data aligned with the target; (iii) criteria to be prioritised when populating the target; and (iv) feedback on candidate results. To support this, the proposed architecture dynamically orchestrates a collection of loosely coupled wrangling components, in which the orchestration is declaratively specified and includes self-tuning of component parameters. Conclusion: This paper describes a data preparation architecture that has been designed to reduce the cost of data preparation through the provision of a central role for automation. An empirical evaluation with deep web and open government data investigates the quality and suitability of the wrangling result, the cost-effectiveness of the approach, the impact of self-tuning, and scalability with respect to the numbers of sources.

show abstract

“…In this case, the accuracy of the algorithm is of concern [22,23]. [25,26]). In general, researchers are aware of the difficulty of detecting duplicates within incomplete data sets [12,18].…”

Section: Related Workmentioning

confidence: 99%

Missing values compensation in duplicates detection using hot deck method

2021

View full text Add to dashboard Cite

Duplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.

show abstract

SemLinker: automating big data integration for casual users

Cited by 8 publications

References 34 publications

A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures

A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures

VADA: an architecture for end user informed data preparation

Missing values compensation in duplicates detection using hot deck method

Contact Info

Product

Resources

About