Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.