2019
DOI: 10.1109/tbdata.2017.2723473
|View full text |Cite
|
Sign up to set email alerts
|

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

Abstract: Abstract-Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
36
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 56 publications
(36 citation statements)
references
References 119 publications
0
36
0
Order By: Relevance
“…The literature is huge, but mostly on distributed system architecture, computing platforms, or data query tools. For a review of recent developments, please see [13,21] and references therein.…”
Section: Related Workmentioning
confidence: 99%
“…The literature is huge, but mostly on distributed system architecture, computing platforms, or data query tools. For a review of recent developments, please see [13,21] and references therein.…”
Section: Related Workmentioning
confidence: 99%
“…Presently, many big data workloads operate across isolated data stores that are distributed geographically and manipulated by different clouds. For example, the typical scientific data processing pipeline [26,40] consists of multiple stages that are frequently conducted by different research organizations with varied computing demands. Accordingly, accelerating data analysis for each stage may require computing facilities that are located in different clouds.…”
Section: Introductionmentioning
confidence: 99%
“…"Data diffusion" [19,20], which can acquire compute and storage resources dynamically, replicate data in response to demand, and schedule computations close to data, has been proposed for Grid computing. In the cloud era, the idea has been extended to scientific workflows that schedule compute tasks [40] and move required data across the global deployment of cloud centers. Both data diffusion and cloud workflows rely on a centralized site that provides data-aware compute tasks scheduling and supports an index service to locate data sets dispersed globally.…”
Section: Introductionmentioning
confidence: 99%
“…Generally, the data center (DC)-based computing infrastructure serves as an effective platform for satisfying both the Peng Zhao, Xinyu Yang computational and data storage requirements of big data analytics. To meet increasing data analysis demands and provide reliability, service providers deploy their data analytics service globally on multiple geographically distributed DCs, referred to as the Geographically-distributed Data Analytics (GDA) [2], [3]. The basic infrastructures for GDA generally consist of a massive number of servers and multiple Internet Data Centers (IDCs) in different locations.…”
Section: Introductionmentioning
confidence: 99%