Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and blocklevel sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.
The rapid development of IoT-based services has resulted in an exponential increase in the number of connected smart mobile devices (SMDs). Processing the massive data generated by the large number of SMDs is becoming a big problem for mobile devices, servers, and wireless communication channels. A Multi-access Edge Computing (MEC) paradigm partially mitigates this problem by deploying edge server nodes at the edge of wireless networks nearby SMDs, but the challenge still remains due to the limited computation capacity of MEC servers and the bandwidth of wireless channels. In addition, the dependency of tasks generated by applications on SMDs increases the complexity of the problem. In this paper, we propose a constrained multiobjective computation offloading optimization solution to resolve the problem of task dependency under limited resources. This solution improves the Quality of Service (QoS) through minimizing the latency, energy consumption, and rate of task failure caused by limited resources. We propose a twostaged hybrid computation offloading optimization method to solve the problem. In the first stage, the computation offloading decisions are made based on the preferences of tasks. Then, in the second stage, global optimal solutions are found using the modified Non-Dominated Sorting Genetic Algorithm (NSGA-III). The overall efficiency of the proposed method is increased owing to the preference-based algorithm Offloading Dependent Tasks in MEC-enabled IoT Systems: A Preference reinforcing the NSGA-III algorithm by generating a better initial population. The results of extensive experiments show that the efficiency of the proposed method is significantly better than the existing methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.