In this paper, we present the design and implementation of a network throughput prediction and optimization service for many-task computing in widely distributed environments. This service uses multiple parallel TCP streams to improve the end-to-end throughput of data transfers. A novel mathematical model is used to decide the number of parallel streams to achieve best performance. This model can predict the optimal number of parallel streams with as few as three prediction points. We implement this new service in the Stork data scheduler, where the prediction points can be obtained using Iperf and GridFTP samplings. Our results show that the prediction cost plus the optimized transfer time is much less than the unoptimized transfer time in most cases.
The end-to-end performance of TCP over wide-area may be a major bottleneck for large-scale network-based applications. Two practical ways of increasing the TCP performance at the application layer is using multiple parallel streams and tuning the buffer size. Tuning the buffer size can lead to significant increase in the throughput of the application. However using multiple parallel streams generally gives better results than optimized buffer size with a single stream. Parallel streams tend to recover from failures quicker and are more likely to steal bandwidth from the other streams sharing the network. Moreover our experiments show that proper usage of tuned buffer size with parallel streams can even increase the throughput more than the cases where only tuned buffers and only parallel streams are used. In that sense, balancing a tuned buffer size and the number of parallel streams and defining the optimal values for those parameters are very important. In this paper, we analyze the results of different techniques to balance TCP buffer and parallel streams at the same time and present the initial steps to a balanced modeling of throughput based on these optimized parameters.
Data placement in complex scientific workflows gradually attracts more attention since the large amounts of data generated by these workflows significantly increases the turnaround time of the end-to-end application. It is almost impossible to make an optimal scheduling for the end-to-end workflow without considering the intermediate data movement. In order to reduce the complexity of the workflow-scheduling problem, most of the existing work constrains the problem space by some unrealistic assumptions, which result in non-optimal scheduling in practice. In this study, we propose a genetic data-aware algorithm for the end-to-end workflow scheduling problem, which performs very close to the optimal solution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.