2021
DOI: 10.48550/arxiv.2108.05439
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

Abstract: For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding nearoptimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is ∝ O(1/ 2 ), where > 0 is the target accuracy of the obtained policy, and ca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 20 publications
0
1
0
Order By: Relevance
“…The reward-free RL setting seeks to explore an MDP without access to a reward function, in order to then determine a near-optimal policy for an arbitrary reward function. This has been studied in the tabular setting (Jin et al, 2020a;Ménard et al, 2020;Zhang et al, 2020a;Wu et al, 2021), where the optimal scaling is known to be Θ(|S| 2 |A|/ 2 ) (Jin et al, 2020a), as well as the function approximation setting (Zanette et al, 2020c;Wang et al, 2020;Zhang et al, 2021a). In the linear MDP setting, Wang et al (2020) show a sample complexity of O( d 3 H 6 2 ), and Zanette et al (2020c) show a complexity of…”
Section: Related Workmentioning
confidence: 99%
“…The reward-free RL setting seeks to explore an MDP without access to a reward function, in order to then determine a near-optimal policy for an arbitrary reward function. This has been studied in the tabular setting (Jin et al, 2020a;Ménard et al, 2020;Zhang et al, 2020a;Wu et al, 2021), where the optimal scaling is known to be Θ(|S| 2 |A|/ 2 ) (Jin et al, 2020a), as well as the function approximation setting (Zanette et al, 2020c;Wang et al, 2020;Zhang et al, 2021a). In the linear MDP setting, Wang et al (2020) show a sample complexity of O( d 3 H 6 2 ), and Zanette et al (2020c) show a complexity of…”
Section: Related Workmentioning
confidence: 99%