Missing value (MV) is one form of data completeness problem in massive datasets. To deal with missing values, data imputation methods were proposed with the aim to improve the completeness of the datasets concerned. Data imputation's accuracy is a common indicator of a data imputation technique's efficiency. However, the efficiency of data imputation can be affected by the nature of the language in which the dataset is written. To overcome this problem, it is necessary to normalize the data, especially in non-Latin languages such as the Arabic language. This paper proposes a method that will address the challenge inherent in Arabic datasets by extending the enhanced robust association rules (ERAR) method with Arabic detection and correction functions. Iterative and Decision Tree methods were used to evaluate the proposed method in an experiment. Experiment results show that the proposed method offers a higher data imputation accuracy than the Iterative and Decision Tree methods.
Clustering method is a technique used for comparisons reduction between the candidates records in the duplicate detection process. The process of clustering records is affected by the quality of data. The more error-free the data, the more efficient the clustering algorithm, as data errors cause data to be placed in incorrect groups. Window algorithms suffer from the window size. The larger the window, the greater the number of unnecessary comparisons, and the smaller the window size may prevent the detection of duplicates that are supposed to be within the window. In this paper, we propose a data pre-processing method that increases the efficiency of window algorithms in grouping similar records together. In addition, the proposed method also deal s with the window size problem. In the proposed method, high-rank attributes are selected and then preparators are applied to the selected traits. A compensation algorithm is implemented to reduce the problem of missing and distorted sort keys. Two datasets (compact disc database (CDDB) and MusicBrainz) were used to test duplicates detection algorithms. The duplicates detection toolkit(DuDe) was used as a benchmark for the proposed method. Experiments showed that the proposed method achieved a high rate of accuracy in detecting duplicates. In addition, the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.