Learning Good State and Action Representations via Tensor Decomposition

Ni, Chengzhuo; Zhang, Anru; Duan, Yaqi; Wang, Mengdi

doi:10.48550/arxiv.2105.01136

2021

DOI: 10.48550/arxiv.2105.01136

|View full text |Cite

Preprint

Learning Good State and Action Representations via Tensor Decomposition

Chengzhuo Ni

Anru Zhang

Yaqi Duan

et al.

Abstract: The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP's tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2021

2022

Publication Types

Select...

Other3

Relationship

Self Cite1

Independent2

Authors

Journals

Cited by 3 publications

(1 citation statement)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Foster et al (2021) focus on instance-dependent bounds, but their bounds scale with a value function disagreement coefficient and inverse value gap, both of which can be arbitrarily large in general Block MDPs (e.g., disagreement coefficient is a stronger notation than the usual classic notation of uniform convergence which is what we use here). Finally,Duan et al (2019) andNi et al (2021) study state abstraction learning from logged data, without trying to identify the optimal policy.…”

mentioning

confidence: 99%

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

Zhang¹,

Song²,

Uehara³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space. Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems which require deep exploration. *

show abstract

mentioning

confidence: 99%

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

Zhang¹,

Song²,

Uehara³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Uehara¹,

Sun²

2021

Preprint

View full text Add to dashboard Cite

We study model-based offline Reinforcement Learning with general function approximation. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism. Under the assumption that the ground truth model belongs to our function class, CPPO can learn with the offline data only providing partial coverage, i.e., it can learn a policy that completes against any policy that is covered by the offline data, in polynomial sample complexity with respect to the statistical complexity of the function class. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage. One notable example is low-rank MDP with representation learning where the partial coverage is defined using the concept of relative condition number measured by the underlying unknown ground truth feature representation. Finally, we introduce and study the Bayesian setting in offline RL.The key benefit of Bayesian offline RL is that algorithmically, we do not need to explicitly construct pessimism or reward penalty which could be hard beyond models with linear structures. We present a posterior sampling based incremental policy optimization algorithm (PS-PO) which proceeds by iteratively sampling a model from the posterior distribution and performing one step incremental policy optimization inside the sampled model. Theoretically, in expectation with respect to the prior distribution, PS-PO can learn a near optimal policy under partial coverage with polynomial sample complexity.

show abstract

Representation Learning for Online and Offline RL in Low-rank MDPs

Uehara

Zhang

Sun

2021

Preprint

View full text Add to dashboard Cite

This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE(Agarwal et al., 2020b)--the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB-Upper Confidence Bound driven REPresentation learning for RL, which significantly improves the sample complexity from O(A 9 d 7 /( 10 (1with d being the rank of the transition matrix (or dimension of the ground truth representation), A being the number of actions, and γ being the discount factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explorethen-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline data distribution.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Learning Good State and Action Representations via Tensor Decomposition

Cited by 3 publications

References 44 publications

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Representation Learning for Online and Offline RL in Low-rank MDPs

Contact Info

Product

Resources

About