Spark is a memory-based distributed data processing framework. Lots of data is transmitted through the network in the shuffle process, which is the main bottleneck of the Spark. Because the partitions are unbalanced in different nodes , the Reduce task input are unbalanced. In order to solve this problem, a partition policy based on task local level is designed to balance the task input. Finally, the optimization mechanism is verified by experiments, which can alleviate the data-skew and improve the efficiency of the job process.
Abstract. Spark is a kind of big data processing platform based on memory computing. The Spark default serialization strategy has low utilization of cache which has greatly influenced the efficiency of Spark task execution. For solving this problem of low computational efficiency caused by insufficient memory, this paper proposes an optimized serialized storage strategy, which combining with the running cot of RDD, RDD execution time and count of Action. Experimental results show that the proposed strategy can improve the computational efficiency under the limited task resources.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.