Toni Cortes scite author profile

The use of the Python programming language for scientific computing has been gaining momentum in the last years. The fact that it is compact and readable and its complete set of scientific libraries are two important characteristics that favour its adoption. Nevertheless, Python still lacks a solution for easily parallelising generic scripts on distributed infrastructures, since the current alternatives mostly require the use of APIs for message passing or are restricted to embarrassingly-parallel computations.In that sense, this paper presents PyCOMPSs, a framework that facilitates the development of parallel computational workflows in Python. In this approach, the user programs her script in a sequential fashion and decorates the functions to be run as asynchronous parallel tasks. A runtime system is in charge of exploiting the inherent concurrency of the script, detecting the data dependencies between tasks and spawning them to the available resources.Furthermore, we show how this programming model can be built on top of a big data storage architecture, where the data stored in the backend is abstracted and

show abstract

Dip: A parallel program development environment

Labarta¹,

Girona²,

Pillet³

et al. 1996

133

View full text Add to dashboard Cite

This paper describes an environment whose aim is to aid in the development and tuning of message passing applications before actually running them in a real system with a large number of processors. Our objective is not to eliminate tests on reed machines but to be able to focus them in a more selective way and thereby minimize their number. The environment presented in this paper consists of three closely integrated tools: an instrumented communication library, a trace driven simulator (Dimemas) and a visualization/analysis tool (Paraver).

show abstract

The XtreemFS architecture—a case for object‐based file systems in Grids

Hupfeld

Cortes

Kolbeck

et al. 2008

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYIn today's Grids, files are usually managed by Grid data management systems that are superimposed on existing file and storage systems. In this paper, we analyze this predominant approach and argue that object-based file systems can be an alternative when adapted to the characteristics of a Grid environment. We describe how we are solving the challenge of extending the object-based storage architecture for the Grid in XtreemFS, an object-based file system for federated infrastructures.

show abstract

A study on data deduplication in HPC storage systems

Meister

Kaiser

Brinkmann

et al. 2012

View full text Add to dashboard Cite

Deduplication is a storage saving technique that is highly successful in enterprise backup environments. On a ﬁle system, a single data block might be stored multiple times across different ﬁles, for example, multiple versions of a ﬁle might exist that are mostly identical. With deduplication, this data replication is localized and redundancy is removed – by storing data just\ud once, all ﬁles that use identical regions refer to the same unique data. The most common approach splits ﬁle data into chunks\ud and calculates a cryptographic ﬁngerprint for each chunk. By checking if the ﬁngerprint has already been stored, a chunk is classiﬁed as redundant or unique. Only unique chunks are stored. This paper presents the ﬁrst study on the potential of data deduplication in HPC centers, which belong to the most demanding storage producers. We have quantitatively assessed this potential for capacity reduction for 4 data centers (BSC, DKRZ,\ud RENCI, RWTH). In contrast to previous deduplication studies focusing mostly on backup data, we have analyzed over one PB\ud (1212 TB) of online ﬁle system data. The evaluation shows that typically 20% to 30% of this online data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This reduction can only be achieved by a subﬁle deduplication approach, while approaches based on whole-ﬁle\ud comparisons only lead to small capacity savings.Peer ReviewedPostprint (published version

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Toni Cortes

PyCOMPSs: Parallel computational workflows in Python

Dip: A parallel program development environment

The XtreemFS architecture—a case for object‐based file systems in Grids

A study on data deduplication in HPC storage systems

Contact Info

Product

Resources

About