Characterizing &amp; Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

Mohtashami, Amirkeivan; Stich, Sebastian U.; Jäggi, Martin

doi:10.48550/arxiv.2202.01838

Cited by 2 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…HaoChen and Sra [2019], Gürbüzbalaban et al [2021], and Mishchenko et al [2020] discuss extensively on the conditions needed for RR to benefit. Some recent works [Lu et al, 2021a, Mohtashami et al, 2022 suggest constructing better data permutations than RR via a memory-intensive greedy strategy. Concretely, Mohtashami et al [2022] proposes evaluating gradients on all the examples first to minimize Equation (2) before starting an epoch, applied to Federated Learning; Lu et al [2021a] provides an alternative of minimizing Equation (2) using stale gradients from previous epoch to estimate the gradient on each example.…”

Section: Related Workmentioning

confidence: 99%

“…Some recent works [Lu et al, 2021a, Mohtashami et al, 2022 suggest constructing better data permutations than RR via a memory-intensive greedy strategy. Concretely, Mohtashami et al [2022] proposes evaluating gradients on all the examples first to minimize Equation (2) before starting an epoch, applied to Federated Learning; Lu et al [2021a] provides an alternative of minimizing Equation (2) using stale gradients from previous epoch to estimate the gradient on each example. Rajput et al [2021] introduces an interesting variant to RR by reversing the ordering every other epoch, achieving better rates for quadratics.…”

Section: Related Workmentioning

confidence: 99%

“…In this section, we study the use of stale gradients-the stochastic gradients for each example from the previous epoch, saved in memory-to construct a data permutation at the start of each epoch. Lu et al [2021a] and Mohtashami et al [2022] propose to use greedy ordering in this way-selecting examples one at a time to minimize Equation (2). Here, we first show that this objective is closely related to the classic herding problem from discrepancy theory.…”

Section: Offline Stale-gradient Herdingmentioning

confidence: 99%

“…Based on the herding framework, a path to SGD with better data permutation becomes clear: when training models during a epoch, we can simply store all the stochastic gradients, and then herd them (with some herding algorithm) offline at the start of the next epoch to obtain the order for that epoch. This stale-gradient approach is formally described in Algorithm 2, and could be contrasted with a fresh-gradient approach (as in Mohtashami et al [2022]) that herded with newly computed stochastic gradients at the start of each epoch-it is desirable to avoid the use of fresh gradients if possible as they double the gradient computations needed each epoch. The greedy ordering of Lu et al [2021a] can be described as running Algorithm 2 using Algorithm 1 as the Herding subroutine.…”

Section: Offline Stale-gradient Herdingmentioning

confidence: 99%

“…Recent studies indicate the possibility of greedily constructing better permutations using stale estimates of each ∇f (w; x j ) from the previous epoch [Lu et al, 2021a, Mohtashami et al, 2022. Concretely, Lu et al [2021a] proved that for any model parameters w ∈ R d , if sums of consecutive stochastic gradients converge faster to the full gradient, then the optimizer will converge faster.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

GraB: Finding Provably Better Data Permutations than Random Reshuffling

Lu¹,

Guo²,

De³

2022

Preprint

View full text Add to dashboard Cite

Random reshuffling, which randomly permutes the dataset each epoch, is widely adopted in model training because it yields faster convergence than with-replacement sampling. Recent studies indicate greedily chosen data orderings can further speed up convergence empirically, at the cost of using more computation and memory. However, greedy ordering lacks theoretical justification and has limited utility due to its non-trivial memory and computation overhead. In this paper, we first formulate an example-ordering framework named herding and answer affirmatively that SGD with herding converges at the rate O(T −2/3 ) on smooth, non-convex objectives, faster than the O(n 1/3 T −2/3 ) obtained by random reshuffling, where n denotes the number of data points and T denotes the total number of iterations. To reduce the memory overhead, we leverage discrepancy minimization theory to propose an online Gradient Balancing algorithm (GraB) that enjoys the same rate as herding, while reducing the memory usage from O(nd) to just O(d) and computation from O(n 2 ) to O(n), where d denotes the model dimension. We show empirically on applications including MNIST, CIFAR10, WikiText and GLUE that GraB can outperform random reshuffling in terms of both training and validation performance, and even outperform state-of-the-art greedy ordering while reducing memory usage over 100×.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Offline Stale-gradient Herdingmentioning

confidence: 99%

Section: Offline Stale-gradient Herdingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

GraB: Finding Provably Better Data Permutations than Random Reshuffling

Lu¹,

Guo²,

De³

2022

Preprint

View full text Add to dashboard Cite

show abstract

On The Impact of Machine Learning Randomness on Group Fairness

Ganesh

Chang

Strobel

et al. 2023

2023 ACM Conference on Fairness, Accountability, and Transparency

View full text Add to dashboard Cite

Statistical measures for group fairness in machine learning reflect the gap in performance of algorithms across different groups. These measures, however, exhibit a high variance between different training instances, which makes them unreliable for empirical evaluation of fairness. What causes this high variance? We investigate the impact on group fairness of different sources of randomness in training neural networks. We show that the variance in group fairness measures is rooted in the high volatility of the learning process on under-represented groups. Further, we recognize the dominant source of randomness as the stochasticity of data order during training. Based on these findings, we show how one can control group-level accuracy (i.e., model fairness), with high efficiency and negligible impact on the model's overall performance, by simply changing the data order for a single epoch. CCS CONCEPTS• Computing methodologies → Machine learning; • General and reference → Evaluation.

show abstract

Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

Cited by 2 publications

References 19 publications

GraB: Finding Provably Better Data Permutations than Random Reshuffling

GraB: Finding Provably Better Data Permutations than Random Reshuffling

On The Impact of Machine Learning Randomness on Group Fairness

Contact Info

Product

Resources

About