Data quality and data cleaning are context dependent activities. Starting from this observation, in previous work a context model for the assessment of the quality of a database instance was proposed. In that framework, the context takes the form of a possibly virtual database or data integration system into which a database instance under quality assessment is mapped, for additional analysis and processing, enabling quality assessment. In this work we extend contexts with dimensions, and by doing so, we make possible a multidimensional assessment of data quality assessment. Multidimensional contexts are represented as ontologies written in Datalog±. We use this language for representing dimensional constraints, and dimensional rules, and also for doing query answering based on dimensional navigation, which becomes an important auxiliary activity in the assessment of data. We show ideas and mechanisms by means of examples. TABLE I Measurements Time Patient Value
In multidimensional (MD) databases and data warehouses we commonly prefer instances that have summarizable dimensions. This is because they have good properties for query answering. Most typically, with summarizable dimensions, precomputed and materialized aggregate query results at lower levels of the dimension hierarchy can be used to correctly compute results at higher levels of the same hierarchy, improving efficiency. Being summarizability such a desirable property, we argue that some established MD models cannot properly model the summarizability condition, and this is a consequence of the limited expressive power of the modeling languages. We propose an extension to the Hurtado-Meldelzon (HM) MD model with subcategories, the EHM model, and show that it allows to capture the summarizability. We propose an efficient algorithm that, for a given cube view (i.e. MD aggregate query) in an EHM database, determines from which minimal subset of precomputed cube views it can be correctly computed. Finally, we show how the EHM can be implemented with minor modifications to the familiar ROLAP schemas.
Summarizability in a multidimensional (MD) database refers to the correct reusability of pre-computed aggregate queries (or views) when computing higher-level aggregations or rollups. A dimension instance has this property if and only if it is strict and homogeneous. A dimension instance may fail to satisfy either of these two semantics conditions, and has to be repaired, restoring strictness and homogeneity. In this work, we take a relational approach to the problem of repairing dimension instances. A dimension repair is obtained by translating the dimension instance into a relational instance, repairing the latter using established techniques in the relational framework, and properly inverting the process. We show that the common relational star and snowflake schemas for MD databases are not the best choice for this process. Actually, for this purpose, we propose and formalize the path relational schema, which becomes the basis for obtaining dimensional repairs. The path schema turns out to have useful properties in general, as a basis for a relational representation and implementation of MD databases and data warehouses. It is also particularly suitable for restoring MD summarizability through relational repairs. We compare the dimension repairs so obtained with existing repair approaches for MD databases.
In multidimensional (MD) databases summarizability is a key property for obtaining interactive response times. With summarizable dimensions, pre-computed and materialized aggregate query results at lower levels of the dimension hierarchy can be used to correctly compute results at higher levels of the same hierarchy, improving efficiency. Being summarizability such a desirable property, we argue that established MD models cannot properly model the summarizability condition, and this is a consequence of the limited expressive power of the modeling languages. In addition, because of limitations in existing MD models, algorithms for deciding summarizability and cube view selection are not efficient or practical.We propose an extension to the Hurtado-Meldelzon (HM) MD model, the EHM model, that includes subcategories and explore its properties specially in addressing issues related to summarizability. We investigate the extended model as a way to directly model MD-DBs, with some clear advantages over HM models. Most importantly, EHM is -in a precise technical sense-more expressive than HM for modeling MDDBs that are subject to summarizability conditions. Moreover, given an MD aggregate query in an EHM database, we can determine in a practical way (that only requires processing the dimension schema as opposed to the instance), from which minimal subset of pre-computed cube views it can be correctly computed.Our extended model allows for a repair approach that transforms non-summarizable HM dimensions into summarizable EHM dimensions. We propose and formalize a twostep process that involves modifying both the schema and the instance of a non-summarizable HM dimension.ii
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.