Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing 2008
DOI: 10.1145/1383519.1383521
|View full text |Cite
|
Sign up to set email alerts
|

Accelerating large-scale data exploration through data diffusion

Abstract: Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when dem… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
54
0

Year Published

2008
2008
2022
2022

Publication Types

Select...
6
3

Relationship

4
5

Authors

Journals

citations
Cited by 56 publications
(54 citation statements)
references
References 18 publications
0
54
0
Order By: Relevance
“…Combining compute and data management: What is even more critical is the combination of the compute and data resource management, which leverages data locality in access patterns to minimize the amount of data movement and improve end-application performance and scalability [24]. Attempting to address storage and computational problems separately forces much data movement between computational and storage resources, which will not scale to tomorrow's exascale datasets and millions of nodes, and will yield significant underutilization of the raw resources.…”
Section: E Data Management Challengementioning
confidence: 99%
“…Combining compute and data management: What is even more critical is the combination of the compute and data resource management, which leverages data locality in access patterns to minimize the amount of data movement and improve end-application performance and scalability [24]. Attempting to address storage and computational problems separately forces much data movement between computational and storage resources, which will not scale to tomorrow's exascale datasets and millions of nodes, and will yield significant underutilization of the raw resources.…”
Section: E Data Management Challengementioning
confidence: 99%
“…gigabit Ethernet) as well as proprietary and more exotic networks (Torus, Tree, and Infiniband). [9,16] We believe that there is more to HPC than tightly coupled MPI, and more to HTC than embarrassingly parallel long running jobs. Like HPC applications, and science itself, applications are becoming increasingly complex opening new doors for many opportunities to apply HPC in new ways if we broaden our perspective.…”
Section: Discussionmentioning
confidence: 99%
“…3) Keeping data size modest, but increasing the number of tasks moves us into the loosely coupled applications involving many tasks (yellow); Swift/Falkon [6,7] and Pegasus/DAGMan [8] are examples of this category. 4) Finally, the combination of both many tasks and large datasets moves us into the data-intensive many-task computing category (green); examples of this category are Swift/Falkon and data diffusion [9], Dryad [ Sawzall [11].…”
Section: Defining Many Task Computingmentioning
confidence: 99%
“…TABLE III To the best of our knowledge, HyCache is the first user-level POSIX-compliant hybrid caching for distributed file systems. Some of our previous work [15][16][17] proposed data caching to accelerate applications by modifying the applications and/or their workflow, rather than the at the filesystem level. Other existing work requires modifying OS kernel, or lacks of a systematic caching mechanism for manipulating files across multiple storage devices, or does not support the POSIX interface.…”
Section: Applicationmentioning
confidence: 99%