Using Differential Execution Analysis to Identify Thread Interference

Bouksiaa, Mohamed Said Mosli; Lescouet, Alexis; Voron, Gauthier; Dulong, Rémi; Guermouche, Amina; Brunet, Élisabeth; Thomas, Gaël

doi:10.1109/tpds.2019.2927481

Cited by 4 publications

(2 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example on Densenet, the fastest worker reads samples in 11.9s while the slowest worker reads samples in 142s. This high variation of I/O performance could indicate that the PFS suffers congestion caused by the 512 workers performing IO concurrently [39]. Moreover, workers wait for each other using collective communication during the gradient exchange.…”

Section: F Performancementioning

confidence: 99%

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

Nguyen

Domke

Drozd

et al. 2022

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Self Cite

View full text Add to dashboard Cite

Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint.In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme.

show abstract

Section: F Performancementioning

confidence: 99%

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

Nguyen

Domke

Drozd

et al. 2022

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In which, the practical response time is the completed time in the simulation tests, and the theoretical response time is the sum of the arrival time and the required time for doing the read/write request. This metric is referring to Reference [34] and can give the theoretical slowdown of read/write requests caused by waiting in the I/O queue.…”

Section: Long-tail Latencymentioning

confidence: 99%

Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs

Sha

Song

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

This article proposes a low I/O intensity-aware scheduling scheme on garbage collection (GC) in SSDs for minimizing the I/O long-tail latency to ensure I/O responsiveness. The basic idea is to assemble partial GC operations by referring to several determinable factors (e.g., I/O characteristics) and dispatch them to be processed together in idle time slots of I/O processing. To this end, it first makes use of Fourier transform to explore the time slots having relative sparse I/O requests for conducting time-consuming GC operations, as the number of affected I/O requests can be limited. After that, it constructs a mathematical model to further figure out the types and quantities of partial GC operations, which are supposed to be dealt with in the explored idle time slots, by taking the factors of I/O intensity, read/write ratio, and the SSD use state into consideration. Through a series of simulation experiments based on several realistic disk traces, we illustrate that the proposed GC scheduling mechanism can noticeably reduce the long-tail latency by between 5.5% and 232.3% at the 99.99th percentile, in contrast to state-of-the-art methods.

show abstract

EZIOTracer

Naas

Colin

Olivier

et al. 2021

Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems

View full text Add to dashboard Cite

Tracing is a popular method for evaluating, investigating, and modeling the performance of today's storage systems. Tracing has become crucial with the increase in complexity of modern storage applications/systems, that are manipulating an ever-increasing amount of data and are subject to extreme performance requirements. There exists many tracing tools focusing either on the user-level or the kernel-level, however we observe the lack of a unified tracer targeting both levels: this prevents a comprehensive understanding of modern applications' storage performance profiles. In this paper, we present EZIOTracer, a unified I/O tracer for both (Linux) kernel and user spaces, targeting data intensive applications. EZIOTracer is composed of a userland as well as a kernel space tracer, complemented with a trace analysis framework able to merge the output of the two tracers, and in particular to relate user-level events to kernel-level ones, and vice-versa. On the kernel side, EZIOTracer relies on eBPF to offer safe, low-overhead, low memory footprint, and flexible tracing capabilities. We demonstrate using FIO benchmark the ability of EZIOTracer to track down I/O performance issues by relating events recorded at both the kernel and user levels. We show that this can be achieved with a relatively low overhead that ranges from 2% to 26% depending on the I/O intensity.

show abstract

Using Differential Execution Analysis to Identify Thread Interference

Cited by 4 publications

References 41 publications

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs

EZIOTracer

Contact Info

Product

Resources

About