Abstract:Data warehousing has been accepted in many enterprises to arrange historical data, regularly provide reports, assist decision making, analyze data and mine potentially valuable information. Its architecture can be divided into several layers from operated databases to presentation interfaces.
With decades of development and innovation, data warehouses and their architectures have been extended to a variety of derivatives in various environments to achieve different organisations' requirements. Although there are some ad-hoc studies on data warehouse architecture (DWHA) investigations and classifications, limited research is relevant to systematically model and classify DWHAs. Especially in the big data era, data is generated explosively. More emerging architectures and technologies are leveraged to manipulate and manage big data in this domain. It is therefore valuable to revisit and investigate DWHAs with new innovations. In this paper, we collect 116 publications and model 73 disparate DWHAs using Archimate, then 9 representative DWHAs are identified and summarised into a "big picture". Furthermore, it proposes a new classification model sticking to state-of-the-art DWHAs. This model can guide researchers and practitioners to identify, analyse and compare differences and trends of DWHAs from componental and architectural perspectives.
Many data driven organisations need to integrate data from multiple, distributed and heterogeneous resources for advanced data analysis. A data integration system is an essential component to collect data into a data warehouse or other data analytics systems. There are various alternatives of data integration systems which are created inhouse or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating data integration systems. When using this benchmark, we find some typical data quality problems in the TPC-DI data source such as multi-meaning attributes and inconsistent data schemas, which could delay or even fail the data integration process. This paper explains processes of this benchmark and summarises typical data quality problems identified in the TPC-DI data source. Furthermore, in order to prevent data quality problems and proactively manage data quality, we propose a set of practical guidelines for researchers and practitioners to conduct data quality management when using the TPC-DI benchmark.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.