“…This inequality confirms that the momentum algorithm can achieve a faster rate in deterministic optimization and, moreover, this faster rate cannot be attained by standard gradient descent. Motivated by these useful acceleration properties in the deterministic context, momentum terms have been subsequently introduced into stochastic optimization algorithms as well (Polyak, 1987;Proakis, 1974;Sharma et al, 1998;Shynk and Roy, June 1988;Roy and Shynk, 1990;Tugay and Tanik, 1989;Bellanger, 2001;Wiegerinck et al, 1994;Hu et al, 2009;Xiao, 2010;Lan, 2012;Ghadimi and Lan, 2012;Zhong and Kwok, 2014) and applied, for example, to problems involving the tracking of chirped sinusoidal signals (Ting et al, 2000) or deep learning (Sutskever et al, 2013;Kahou et al, 2013;Szegedy et al, 2015;Zareba et al, 2015). However, the analysis in this paper will show that the advantages of the momentum technique for deterministic optimization do not necessarily carry over to the adaptive online setting due to the presence of stochastic gradient noise (which is the difference between the actual gradient vector and its approximation).…”