In December 2019 a virus named COVID-19 appeared in China, precisely in the city of Wuhan. This virus was declared a global pandemic by the World Health Organization in March 2020. Since no adequate medical treatment has yet been discovered for this virus, many world institutions are committed to share with each other the data they collect and process in their laboratories. A large amount of these data is shared with citizens in order to inform about the risk that threaten us by virus COVID-19.Various credible world institutions such as the World Health Organization (WHO), Johns Hopkins University (JHU), the European Centre for Disease Prevention and Control (ECDC), etc., are providing various statistical data to address the issues raised by this emergent situation, but these reports in some cases are putting doubts on the completeness and the transparency of the data, which are not sufficiently processed and which then create confusion about the risks that we are facing.In this paper we are conducting a study of the quality of current global datasets from the must credible sources related to COVID-19. Also, we are comparing datasets collected from Republic of Kosovo and Republic of North Macedonia with corresponding data from WHO, ECDC and JHU datasets.To analyze datasets from different sources, we are using Power BI tool, making the improvement through the implementation of adequate dimensions and methods of improving the quality of datasets.
The quality of the data in core electronic registers has constantly decreased as a result of numerous errors that were made and inconsistencies in the data in these databases due to the growing number of databases created with the intention of providing electronic services for public administration and the lack of the data harmonization or interoperability between these databases.Evaluating and improving the quality of data by matching and linking records from multiple data sources becomes exceedingly difficult due to the incredibly large volume of data in these numerous data sources with different data architectures and no unique field to create interconnection among them.Different algorithms are developed to treat these issues and our focus will be on algorithms that handle large amounts of data, such as Levenshtein distance (LV) algorithm and Damerau-Levenshtein distance (DL) algorithm.In order to analyze and evaluate the effectiveness and quality of data using the mentioned algorithms and making improvements to these algorithms, through this paper we will conduct experiments on large data sets with more than 1 million records.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.