“…In the non-convex and smooth setting, [Ward et al, 2020] and [Li and Orabona, 2018] prove that the "norm" version of AdaGrad converges to a stationary point at rate O 1/ε 2 for stochastic GD and at rate O (1/ε) for batch GD. Many modifications to AdaGrad have been proposed, namely, RMSprop [Hinton et al, 2012], AdaDelta [Zeiler, 2012], Adam [Kingma and Ba, 2014], AdaFTRL [Orabona and Pál, 2015], SGD-BB [Tan et al, 2016], AcceleGrad [Levy et al, 2018], Yogi [Zaheer et al, 2018a], Padam [Chen and Gu, 2018], to name a few. More recently, accelerated adaptive gradient methods have also been proven to converge to stationary points [Barakat and Bianchi, 2018, Zaheer et al, 2018b, Zhou et al, 2018, Zou et al, 2018b.…”