O desenvolvimento de data warehouse em ambientes em nuvem tem crescido. A modelagem de dados neste ambiente ainda não tem um padrão definido. Assim, esse artigo tem como objetivo apresentar uma análise comparativa de desempenho do uso da plataforma Hive no modelo floco de neve e totalmente desnormalizado. Os dados utilizados para análise são os dados abertos do Exército Brasileiro no ambiente Google Cloud. As análises são realizadas para diferentes quantidades de linhas no Hive, para um cenário de configuração do cluster e para dois tipos de armazenamento das tabelas. Por fim, utilizando o formato Parquet nas tabelas, obteve-se um desempenho mais de quatro vezes superior ao do formato CSV.
With the emergence of Big Data and the continuous growth of massive data produced by web applications, smartphones, social networks, and others, organizations began to invest in alternative solutions that would derive value from this amount of data. In this context, this article evaluates three factors that can significantly influence the performance of Big Data Hive queries: data modeling, data format and processing tool. The objective is to present a comparative analysis of the Hive platform performance with the snowflake model and the fully denormalized one. Moreover, the influence of two types of table storage file types (CSV and Parquet) and two types of data processing tools, Hadoop and Spark, were also comparatively analyzed. The data used for analysis is the open data of the Brazilian Army in the Google Cloud environment. Analysis was performed for different data volumes in Hive and cluster configuration scenarios. The results yielded that the Parquet storage format always performed better than when CSV storage formats were used, regardless of the model and processing tool selected for the test scenario.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.