2020
DOI: 10.48550/arxiv.2006.06294
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adaptive Reward-Free Exploration

Abstract: Reward-free exploration is a reinforcement learning setting recently studied by Jin et al. [17], who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994 [11], originally proposed for a different objective tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
10
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(12 citation statements)
references
References 11 publications
2
10
0
Order By: Relevance
“…Note this framework captures many practical MORL applications including the previous safe autonomous driving applications. It also extends the recent proposed reward-free exploration in reinforcement learning [Jin et al, 2020, Zhang et al, 2020a, Kaufmann et al, 2020, Wang et al, 2020 to MORL cases.…”
Section: Introductionmentioning
confidence: 70%
See 1 more Smart Citation
“…Note this framework captures many practical MORL applications including the previous safe autonomous driving applications. It also extends the recent proposed reward-free exploration in reinforcement learning [Jin et al, 2020, Zhang et al, 2020a, Kaufmann et al, 2020, Wang et al, 2020 to MORL cases.…”
Section: Introductionmentioning
confidence: 70%
“…In this section we study MORL in the preference-free exploration (PFE) setting. We note that PFE is the counter part of reward-free exploration (RFE) [Jin et al, 2020, Kaufmann et al, 2020, Zhang et al, 2020a, Wang et al, 2020 in MORL. PFE consists two phases.…”
Section: Preference-free Explorationmentioning
confidence: 99%
“…It requires high probability guarantee for learning optimal policy for any reward function, which is strictly stronger than the standard learning task that one only needs to learn to optimal policy for a fixed reward. Later, Kaufmann et al (2020); Menard et al (2020) establish the Õ(H 3 S 2 A/ǫ 2 ) complexity and Zhang et al (2020d) further tightens the dependence to Õ(H 2 S 2 A/ǫ 2 ). 12 Recently, Zhang et al (2020c) proposes the task-agnostic setting where one needs to use exploration data to simultaneously learn K tasks and provides a upper bound with complexity Õ(H 5 SA log(K)/ǫ 2 ).…”
Section: Discussionmentioning
confidence: 91%
“…For the single-agent scenario, Jin et al (2020a) formalizes the reward-free RL for the tabular setting and provide theoretical analysis for the proposed algorithm with an O(poly(H, |S|, |A|)/ε 2 ) sample complexity for achieving ε-suboptimal policy. The sample complexity for the tabular setting is further improved in several recent works (Kaufmann et al, 2020;Ménard et al, 2020;Zhang et al, 2020). Recently, Zanette et al (2020b); Wang et al (2020a) study the reward-free RL from the perspective of the linear function approximation.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, many works focus on designing provably sample-efficient reward-free RL algorithms. For the single-agent tabular case, Jin et al (2020a); Kaufmann et al (2020); Ménard et al (2020); Zhang et al (2020) achieve O(poly(H, |S|, |A|)/ε 2 ) sample complexity for obtaining ε-suboptimal policy, where |S|, |A| are the sizes of state and action space, respectively. In view of the large action and state spaces, the works Zanette et al (2020b); Wang et al (2020a) theoretically analyze reward-free RL by applying the linear function approximation for the single-agent Markov decision process (MDP), which achieve O(poly(H, d)/ε 2 ) sample complexity with d denoting the dimension of the feature space.…”
Section: Introductionmentioning
confidence: 99%