2020
DOI: 10.1007/978-3-030-59152-6_6
|View full text |Cite
|
Sign up to set email alerts
|

Faithful and Effective Reward Schemes for Model-Free Reinforcement Learning of Omega-Regular Objectives

Abstract: Omega-regular properties-specified using linear time temporal logic or various forms of omega-automata-find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for model-free RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an ω-regular property to that for a simple reachability objective. A possible dr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
17
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

3
4

Authors

Journals

citations
Cited by 19 publications
(17 citation statements)
references
References 32 publications
0
17
0
Order By: Relevance
“…A common pattern in these previous works Hahn et al 2020) is that each work constructs a product MDP with rewards (i.e., an MDP with a reward function on that MDP) from an LTL formula and an environment MDP. Moreover, these works permit the use of any standard reinforcementlearning algorithm, such as Q-learning or SARSA(λ), to solve the constructed product MDP with the specified reward function to obtain the product MDP's optimal policy.…”
Section: F1 Details Of Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…A common pattern in these previous works Hahn et al 2020) is that each work constructs a product MDP with rewards (i.e., an MDP with a reward function on that MDP) from an LTL formula and an environment MDP. Moreover, these works permit the use of any standard reinforcementlearning algorithm, such as Q-learning or SARSA(λ), to solve the constructed product MDP with the specified reward function to obtain the product MDP's optimal policy.…”
Section: F1 Details Of Methodologymentioning
confidence: 99%
“…We then characterize each reinforcement-learning algorithm as a "reward-scheme" and "learning-algorithm" pair. We consider a total of five reward-schemes 4 : Rewardon-acc , Multi-discount , Zeta-reach , Zeta-acc (Hahn et al 2020), and Zeta-discount (Hahn et al 2020). We consider a total of three learning-algorithms: Q-learning (Watkins and Dayan 1992), Double Q-learning (Hasselt 2010), and SARSA(λ) (Sutton 1988).…”
Section: F1 Details Of Methodologymentioning
confidence: 99%
“…-Dense limit-reachability. The dense limit-reachability reward scheme [12] connects the approaches of [11] and [3]. This reward scheme is identical to [11] except for giving a +1 reward given every time an accepting transition is seen, instead of only when the transition to the sink succeeds.…”
Section: Overview Of Mungojerriementioning
confidence: 99%
“…The tool also has methods for performing probabilistic model checking (including end-component decomposition, stochastic shortest-path, and discounted-reward optimization) of ω-regular objectives on the same data structures used for learning. Mungojerrie also provides reference implementations of several reward schemes [11,12,14,19,23] proposed by the formal methods community. Mungojerrie is packaged with over 100 benchmarks and outputs GraphViz [8] for easy visualization of small models and automata.…”
Section: Introductionmentioning
confidence: 99%
“…We obtain a Markov decision process (MDP) whose transitions are labeled with multiple separate Büchi conditions. Then, generalizing a method described in previous work [13], we transform different combinations of Büchi conditions into different rewards in a reduction to a weighted reachability problem that depends on the prioritisation. We end up with an MDP equipped with a standard scalar reward, to which general RL algorithms can be applied.…”
Section: Introductionmentioning
confidence: 99%