Amir Behrouzi-Far scite author profile

2018

We study the expected completion time of some recently proposed algorithms for distributed computing which redundantly assign computing tasks to multiple machines in order to tolerate a certain number of machine failures. We analytically show that not only the amount of redundancy but also the task-to-machine assignments affect the latency in a distributed system. We study systems with a fixed number of computing tasks that are split in possibly overlapping batches, and independent exponentially distributed machine service times. We show that, for such systems, the uniform replication of non-overlapping (disjoint) batches of computing tasks achieves the minimum expected computing time.

show abstract

Load Balancing Performance in Distributed Storage with Regular Balanced Redundancy

Aktas

2019

Contention at the storage nodes is the main cause of long and variable data access times in distributed storage systems. Offered load on the system must be balanced across the storage nodes in order to minimize contention, and load balance in the system should be robust against the skews and fluctuations in content popularities. Data objects are replicated across multiple nodes in practice to allow for load balancing. However redundancy increases the storage requirement and should be used efficiently. We evaluate load balancing performance of natural storage schemes in which each data object is stored at d different nodes and each node stores the same number of objects. We find that load balance in a system of n nodes improves multiplicatively with d as long as d = o (log(n)), and improves exponentially as soon as d = Θ (log(n)). We show that the load balance in the system improves the same way with d when the service choices are created with XOR's of r objects rather than object replicas, which also reduces the storage overhead multiplicatively by r. However, unlike accessing an object replica, access through a recovery set composed by an XOR'ed object copy requires downloading content from r nodes, which increases the load imbalance in the system additively by r.

show abstract

Redundancy Scheduling in Systems with Bi-Modal Job Service Time Distributions

2019

Queuing systems with redundant requests have drawn great attention because of their promise to reduce the job completion time and variability. Despite a large body of work on the topic, we are still far from fully understanding the benefits of redundancy in practice. We here take one step towards practical systems by studying queuing systems with bi-modal job service time distribution. Such distributions have been observed in practice, as can be seen in, e.g., Google cluster traces. We develop an analogy to a classical urns and balls problem, and use it to study the queuing time performance of two non-adaptive classical scheduling policies: random and round-robin. We introduce new performance indicators in the analogous model, and argue that they are good predictors of the queuing time in non-adaptive scheduling policies. We then propose a non-adaptive scheduling policy that is based on combinatorial designs, and show that it has better performance indicators. Simulations confirm that the proposed scheduling policy, as the performance indicators suggest, reduces the queuing times compared to random and round-robin scheduling.

show abstract

Data Replication for Reducing Computing Time in Distributed Systems with Stragglers

Behrouzi-Far¹,

Soljanin²

2019

Preprint

Efficient Replication for Fast and Predictable Performance in Distributed Computing

IEEE/ACM Trans. Networking

2021