Data Accuracy is one of the main dimensions of Data Quality; it measures the degree to which data are correct. Knowing the accuracy of an organization's data reflects the level of reliability it can assign to them in decision-making processes. Measuring data accuracy in Big Data environment is a process that involves comparing data to assess with some "reference data" considered by the system to be correct. However, such a process can be complex or even impossible in the absence of appropriate reference data. In this paper, we focus on this problem and propose an approach to obtain the reference data thanks to the emergence of Big Data technologies. Our approach is based on the upstream selection of a set of criteria that we define as "Accuracy Criteria". We use furthermore a set of techniques such as Big Data Sampling, Schema Matching, Record Linkage, and Similarity Measurement. The proposed model and experiment results allow us to be more confident in the importance of data quality assessment solution and the configuration of the accuracy criteria to automate the selection of reference data in a Data Lake.
Big Data often refers to a set of technologies dedicated to deal with large volumes of data. Data Quality and Data Security are two essential aspects for any Big Data project. While Data Quality Management Systems are about putting in place a set of processes to assess and improve certain characteristics of data such as Accuracy, Consistency, Completeness, Timeliness, etc., Security Systems are designed to protect the Confidentiality, Integrity and Availability of data. In a Big Data environment, data quality processes can be blocked by data security mechanisms. Indeed, data is often collected from external sources that could impose their own security policies. In many research works, it has been recognized that merging and integrating access control policies are real challenges for Big Data projects. To address this issue, we suggest in this paper a framework to secure data collection in collaborative platforms. Our framework extends and combines two existing frameworks namely: PolyOrBAC and SLA- Framework. PolyOrBAC is a framework intended for the protection of collaborative environments. SLA-Framework, for its part, is an implementation of the WS-Agreement Specification, the standard for managing bilaterally negotiable SLAs (Service Level Agreements) in distributed systems; its integration into PolyOrBAC will automate the implementation and application of security rules. The resulting framework will then be incorporated into a data quality assessment system to create a secure and dynamic collaborative activity in the Big Data context.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.