Data locality is an important concept in big data processing. Most of the
existing research optimized data locality from the aspect of task
scheduling. However, as the execution container of tasks, the executors
started on which nodes can directly affect the locality level achieved by
the tasks. This paper tries to improve the data locality by executor
allocation for reduce stage in Spark computing environment. Firstly, we
calculate the network distance matrix of executors and formulate an optimal
executor allocation problem to minimize the total communication distance.
Then, when the network distance between executors satisfies the triangular
inequality, an approximate algorithm is proposed; and when the network
distance between executors does not satisfy the triangular inequality, a
greedy algorithm is proposed. Finally, we evaluate the performance of our
algorithms in a practical Spark cluster by using several representative
micro-benchmarks (Sort and Join) and macro-benchmarks (PageRank and LDA).
Experimental results show that the proposed algorithms can decrease the
execution time of tasks for lower data communication.