Simulated annealing (SA) is an effective method for solving unconstrained optimization problems and has been widely used in machine learning and neural network. Nowadays, in order to optimize complex problems with big data, the SA algorithm has been implemented on big data platform and obtains a certain speedup. However, the efficiency for such implementation is still limited because the conventional SA algorithm still runs with low parallelism on new platforms and the computing resource cannot be fully utilized. For these problems, this paper raised a speculative parallel SA algorithm based on Apache Spark to expand the algorithm's parallelism and enhance its efficiency. In this paper, first, the inner dependencies, which stop conventional algorithm, run in parallel, are analyzed. Then, based on the analysis, the Software Thread-Level Speculation technique is employed to help the conventional algorithm overcome the dependencies and make it run concurrently. Finally, a new parallel SA algorithm with speculation mechanism is proposed and implemented on Apache Spark. The experiments show that, for big data problems, the proposed algorithm could achieve an optimal parallelism when comparing the traditional algorithm without speculation on Apache Spark. Moreover, the execution efficiency of simulated annealing process can be markedly enhanced by the proposed algorithm.
KEYWORDSApache Spark, parallel computing, simulated annealing, thread level speculation
INTRODUCTIONSimulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space. 1 However, with the data growing larger, traditional SA algorithm is barely to fulfill the present speed request of data processing. This is because, for one thing, the traditional method is inefficient and poorly paralleled, for another, the computing power the present platforms could afford is not enough to support large scale processing.To solve this problem, parallel programming is frequently used in accelerating the conventional SA algorithm. Researchers have successfully implemented parallelized SA algorithm on multicore platforms and achieved certain speedup. Moreover, a good approach to process large-scale data is to make use of enormous computing power offered by modern distributed computing platforms (or called big data platforms) like ApacheHadoop 2 or Apache Spark. 3 These popular platforms adopt MapReduce, which is a specialization of the split-apply-combine strategy for data analysis, as their programming model. A standard MapReduce program is composed by pairs of Map operations (whose job is to sorting or filtering data) and Reduce operations (whose job is to do summary of the result by Map operations), and a high parallelism of a MapReduce model is obtained by marshalling the operation pairs and performing them in distributed servers in parallel. The platforms that adopt MapReduce model often split the input, turn the large scale prob...