New generation application problems in big data and high-performance computing (HPC) areas claim very diverse operational properties. The convergence requires the dynamic behavior of system components. Load balancing is a critical issue in response to the highly unpredictable, dynamic, and data-oriented behavior of the system. Possible practical constraints such as communication and load transfer delays play an essential role in designing a dynamic load balancer. On the other hand, according to most of the new platforms' distributed nature, the load balancer should be able to perform in a fully distributed manner. In this research, we consider practical issues, including different processing power, storage capability, communication, load transfer delays, and propose two distributed and optimized load balancing methods in HPC for Big Data processing. We model the constraints and present an argument named compensating factor for the optimized load balancer. We try to minimize the task execution time by reducing the nodes' idle time. We evaluate the proposed methods in different scenarios by using Monte Carlo. Evaluations results show that proposed methods decrease idle time significantly while being scalable to network size and applicable in heterogeneous networks with dynamic resources and configuration.
K E Y W O R D Sbig data, distributed computing, high-performance computing, load balancing, optimization
INTRODUCTIONAt present, a large amount of data is being generated exponentially due to the massive number of sensors, the Internet of Things (IoT), and connected devices. There are various big data sources such as social media, black box data, stock exchange data, power grid data, transport data, and search engine data that need a huge amount of processing power almost in real-time scenarios. These types of data and information on this scale require to be managed by runtime tools. High-performance computing (HPC) systems are actual solutions for vast and complex processing. However, traditional HPC tools are not adequate, and runtime tools are needed for big data processing on HPC platforms. 1Big data refers to the emerging technologies designed to extract value from data having at least three Vs. namely, volume, variety, and velocity.We can say that "big data" is a collection of large amounts of information with increasing capacity, stored a huge volume of data that can be structured, semi-structured, unstructured, and time-stamped. The statistical and regression techniques may be used for the analysis of this amount of data. 2One of the most important and challenging tools in HPC platforms is the load balancer. The load balancer divides the processing load among available computing systems so that processing work is performed in the minimum time, considering the given constraints. In load balancing, the