As the explosive growth of the data volume, data center is playing a critical role to store and process huge amount of data. Traditional single data center can no longer to adapt into incredibly fastgrowing data. Recently, some researches have extended the tasks such data processing to geographically distributed data centers. However, since the joint consideration of task placement and data transfer, it is complex and difficult to design a proper scheduling approach with the goal of minimizing makespan under the constraint of task dependencies, processing capability and network, etc. Therefore, our work proposes JHT D : an efficient joint scheduling framework based on hypergraph for task placement and data transfer across geographically distributed data centers. Generally, there are two crucial stages in JHT D. Initially, due to the outstanding of hypergraphs in modeling complex problems, we have leveraged a hypergraph-based model to establish the relationship between tasks, data files, and data centers. Thereafter, a hypergraph-based partition method has been developed for task placement within the first stage. In the second stage, a task reallocation scheme has been devised in terms of each task-to-data dependency. Meanwhile, a data dependency aware transferring scheme has been designed to minimize the makespan. Last, the real-world model China-VO project has been used to conduct a variety of simulation experiments. The results have demonstrated that JHT D effectively optimizes the problems of task placement and data transfer across geographically distributed data centers. JHT D has been compared with three other stateof-the-art algorithms. The results have demonstrated that JHT D can reduce the makespan by up to 20.6%. Also, various impacts (data transfer volume and load balancing) have been taken into account to show and discuss the effectiveness of JHT D.INDEX TERMS Big data processing, Geographically distributed data centers, Joint scheduling framework, Hypergraph, Task placement, Data transferring.
I. INTRODUCTIONWith the advent of Big Data era, the rate of data generation is dramatically increasing. For example, Internet giants such as Google and Facebook crunch more than 10 PB of data a day [1]. As a result, it is essential to improve the efficiency of data processing in the face the huge amount of data.MapReduce [2] and Spark [3] have been widely adopted to deal with large amounts of data. These frameworks usually process data analytic jobs characterized by data-dependency awareness. These jobs can be divided into a set of dependent tasks. The execution of a task not only requires the outcome of the parent tasks, but also the data. Normally, the data and