Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable Optimal Action-Value Functions

Weisz, Gellért; Amortila, Philip; Szepesvári, Csaba

doi:10.48550/arxiv.2010.01374

Cited by 7 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the remainder of this paper, we will refer to them as the Nash Bellman operator and µ-Bellman operator, respectively. It is known that RL with function approximation is in general statistically intractable without further assumptions (see, e.g., hardness results in [29,55]). Below, we present two assumptions that are generalizations of commonly adopted assumptions in MDP literature.…”

Section: Function Approximationmentioning

confidence: 99%

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

Jin,

Liu,

2021

Preprint

View full text Add to dashboard Cite

Modern reinforcement learning (RL) commonly engages practical problems with large state spaces, where function approximation must be deployed to approximate either the value function or the policy. While recent progresses in RL theory address a rich set of RL problems with general function approximation, such successes are mostly restricted to the single-agent setting. It remains elusive how to extend these results to multi-agent RL, especially in the face of new game-theoretical challenges. This paper considers two-player zero-sum Markov Games (MGs). We propose a new algorithm that can provably find the Nash equilibrium policy using a polynomial number of samples, for any MG with low multi-agent Bellman-Eluder dimension-a new complexity measure adapted from its single-agent version [26]. A key component of our new algorithm is the exploiter, which facilitates the learning of the main player by deliberately exploiting her weakness. Our theoretical framework is generic, which applies to a wide range of models including but not limited to tabular MGs, MGs with linear or kernel function approximation, and MGs with rich observations.

show abstract

Section: Function Approximationmentioning

confidence: 99%

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

Jin,

Liu,

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Thus, this result immediately sheds light on the challenges involved in achieving minimax optimal regret for general RL with linear function approximation. Reinforcement Learning with Linear Function Approximation Recent years have witnessed a flurry of activity on RL with linear function approximation (e.g., Jiang et al, 2017;Yang and Wang, 2019a,b;Jin et al, 2020;Wang et al, 2019;Modi et al, 2020;Dann et al, 2018;Du et al, 2019;Sun et al, 2019;Zanette et al, 2020a,b;Cai et al, 2019;Jia et al, 2020;Ayoub et al, 2020;Weisz et al, 2020;Zhou et al, 2020;He et al, 2020a). These results can be generally grouped into four categories based on their assumptions on the underlying MDP.…”

Section: Related Workmentioning

confidence: 99%

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Zhou¹,

Gu²,

Szepesvári³

2020

Preprint

View full text Add to dashboard Cite

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020;Ayoub et al., 2020;Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named UCRL-VTR + for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that UCRL-VTR + attains an O(dH √ T ) regret where d is the dimension of feature mapping, H is the length of the episode and T is the number of interactions with the MDP. We also prove a matching lower bound Ω(dH √ T ) for this setting, which shows that UCRL-VTR + is minimax optimal up to logarithmic factors. In addition, we propose the UCLK + algorithm for the same family of MDPs under discounting and show that it attains an O(d √ T /(1 − γ) 1.5 ) regret, where γ ∈ [0, 1) is the discount factor. Our upper bound matches the lower bound Ω(d √ T /(1 − γ) 1.5 ) proved in Zhou et al. ( 2020) up to logarithmic factors, suggesting that UCLK + is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

show abstract

“…Exploration has been widely studied in the tabular setting (Azar et al, 2017;Zanette and Brunskill, 2019;Efroni et al, 2019;Jin et al, 2018;Dann et al, 2019;Zhang et al, 2020;Russo, 2019), but obtaining formal guarantees for exploration with function approximation is a challenge even in the linear case due to recent lower bounds (Du et al, 2019;Weisz et al, 2020;Zanette, 2020;Wang et al, 2020a). When the action-value function is only approximately linear, several ideas from tabular exploration and linear bandits (Lattimore and Szepesvári, 2020) have been combined to obtain provably efficient algorithms in low-rank MDPs (Yang and Wang, 2020;Zanette et al, 2020a;Jin et al, 2020) and their extensions (Wang et al, 2019(Wang et al, , 2020b.…”

Section: B Additional Related Literaturementioning

confidence: 99%

“…These conditions are in some sense necessary, especially for high dimensional problems; otherwise, the learner in the worst case would require exponentially many samples before discovering any useful information (see e.g. (Kakade et al, 2003;Krishnamurthy et al, 2016;Weisz et al, 2020)). However, these provably efficient RL algorithms are typically not robust to model misspecification, because their performance guarantees allow for only small ℓ ∞ -bounded perturbations from their assumptions.…”

Section: Introductionmentioning

confidence: 99%

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation

Zanette¹,

Cheng²,

Agarwal³

2021

Preprint

View full text Add to dashboard Cite

Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy requirement precludes data reuse and the incremental updates couple large iteration complexity into the sample complexity. These characteristics have been observed in experiments as well as in theory in the recent work of Agarwal et al. (2020a), which provides a policy optimization method PC-PG that can robustly find near optimal polices for approximately linear Markov decision processes but suffers from an extremely poor sample complexity compared with value-based techniques.In this paper, we propose a new algorithm, COPOE, that overcomes the sample complexity issue of PC-PG while retaining its robustness to model misspecification. Compared with PC-PG, COPOE makes several important algorithmic enhancements, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new reinforcement learning algorithms. The result is an improvement in sample complexity from O(1/ǫ 11 ) for PC-PG to O(1/ǫ 3 ) for COPOE, nearly bridging the gap with value-based techniques.

show abstract

Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable Optimal Action-Value Functions

Cited by 7 publications

References 0 publications

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation

Contact Info

Product

Resources

About