Markov chain block coordinate descent

Sun, Tao; Sun, Yuejiao; Xu, Yangyang; Yin, Wotao

doi:10.1007/s10589-019-00140-7

Cited by 34 publications

(85 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, it would be interesting to see whether the techniques developed herein can be exploited towards understanding model-free algorithms with more sophisticated exploration schemes [64]. Finally, asynchronous Q-learning on a single Markovian trajectory is closely related to coordinate descent with coordinates selected according to a Markov chain; one would naturally ask whether our analysis framework can yield improved convergence guarantees for general Markov-chain-based optimization algorithms [65], [66].…”

Section: Discussionmentioning

confidence: 99%

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Wei

Chi

et al. 2022

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a γ-discounted MDP with state space S and action space A, we demonstrate that the ∞ -based sample complexity of classical asynchronous Q-learning -namely, the number of samples needed to yield an entrywise ε-accurate estimate of the Q-function -is at most on the order of 1 μ min (1−γ ) 5 ε 2 + t mix μ min (1−γ ) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, tmix and μmin denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the cost taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the stateof-the-art result by a factor of at least |S||A| for all scenarios, and by a factor of at least tmix|S||A| for any sufficiently small accuracy level ε. Further, we demonstrate that the scaling on the effective horizon 1 1−γ can be improved by means of variance reduction.

show abstract

Section: Discussionmentioning

confidence: 99%

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Wei

Chi

et al. 2022

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

show abstract

“…assumption on data samples. We obtain global convergence to stationary points of rate O((log n) 1+ε /n 1/2 ), matching the optimal convergence rates for SGD based methods [SSY18,DD19]. Interestingly, our analysis shows that SBMM (and hence SMM) is more adapted to solve empirical loss minimization than expected loss minimization, in the sense that the aforementioned rate of convergence holds for the empirical loss functions almost surely and in expectation for the expected loss function; an almost sure convergence for the empirical loss function is obtained at a slower rate of O((log n) 1+ε /n 1/4 ).…”

Section: Introductionmentioning

confidence: 52%

“…Assumption (A4) states that the sequence of weights w n ∈ (0, 1] we use to recursively define the empirical loss (1) and surrogate loss (7) does not decay too fast so that ∞ n=1 w n = ∞ but decay fast enough so that ∞ n=1 w 2 n < ∞. This is analogous to requirements for stepsizes in stochastic gradient descent algorithms, where the stepsizes are usually required to be non-summable but square-summable (see, e.g., [SSY18]). Note that our general results do not require the stronger assumption ∞ n=1 w 2 n n < ∞, which is standard in the literature [MBPS10, Mai13b, MMTV17, LNB20, LSN20].…”

Section: (A6)mentioning

confidence: 99%

“…Optimization algorithms with Markovian data samples were studied in [JRJ07,JRJ10] in the context of distributed optimization in networks. More recently, it was shown in [SSY18] that arbitrarily initialized SGD almost surely converges to critical points of unconstrained nonconvex objectives at rate O((log n) 2 / n), even when the data samples have a Markovian dependence.…”

Section: Introductionmentioning

confidence: 99%

“…Stochastic Gradient Descent (SGD) is another popular method for various optimization problems. In [SSY18], a convergence of SGD under Markovian data assumption is obtained. For the convex case, [SSY18, Thm.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates

Lyu¹

2022

Preprint

View full text Add to dashboard Cite

Stochastic majorization-minimization (SMM) is an online extension of the classical principle of majorization-minimization, which consists of sampling i.i.d. data points from a fixed data distribution and minimizing a recursively defined majorizing surrogate of an objective function. In this paper, we introduce stochastic block majorization-minimization, where the surrogates can now be only block multi-convex and a single block is optimized at a time within a diminishing radius. Relaxing the standard strong convexity requirements for surrogates in SMM, our framework gives wider applicability including online CANDECOMP/PARAFAC (CP) dictionary learning and yields greater computational efficiency especially when the problem dimension is large. We provide an extensive convergence analysis on the proposed algorithm, which we derive under possibly dependent data streams, relaxing the standard i.i.d. assumption on data samples. We show that the proposed algorithm converges almost surely to the set of stationary points of a nonconvex objective under constraints at a rate O((log n) 1+ε /n 1/2 ) for the empirical loss function and O((log n) 1+ε /n 1/4 ) for the expected loss function, where n denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to O((log n) 1+ε /n 1/2 ). Our results provide first convergence rate bounds for various online matrix and tensor decomposition algorithms under a general Markovian data setting.

show abstract

Private Weighted Random Walk Stochastic Gradient Descent

Ayache

Rouayheb

2021

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

We consider a decentralized learning setting in which data is distributed over nodes in a graph.The goal is to learn a global model on the distributed data without involving any central entity that needs to be trusted. While gossip-based stochastic gradient descent (SGD) can be used to achieve this learning objective, it incurs high communication and computation costs, since it has to wait for all the local models at all the nodes to converge. To speed up the convergence, we propose instead to study random walk based SGD in which a global model is updated based on a random walk on the graph.We propose two algorithms based on two types of random walks that achieve, in a decentralized way, uniform sampling and importance sampling of the data. We provide a non-asymptotic analysis on the rate of convergence, taking into account the constants related to the data and the graph. Our numerical results show that the weighted random walk based algorithm has a better performance for high-variance data. Moreover, we propose a privacy-preserving random walk algorithm that achieves local differential privacy based on a Gamma noise mechanism that we propose. We also give numerical results on the convergence of this algorithm and show that it outperforms additive Laplace-based privacy mechanisms.

show abstract

Markov chain block coordinate descent

Cited by 34 publications

References 26 publications

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates

Private Weighted Random Walk Stochastic Gradient Descent

Contact Info

Product

Resources

About