Faithful and Effective Reward Schemes for Model-Free Reinforcement Learning of Omega-Regular Objectives

Hahn, Ernst Moritz; Perez, Mateo; Schewe, Sven; Somenzi, Fabio; Trivedi, Ashutosh; Wojtczak, Dominik

doi:10.1007/978-3-030-59152-6_6

Cited by 19 publications

(17 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A common pattern in these previous works Hahn et al 2020) is that each work constructs a product MDP with rewards (i.e., an MDP with a reward function on that MDP) from an LTL formula and an environment MDP. Moreover, these works permit the use of any standard reinforcementlearning algorithm, such as Q-learning or SARSA(λ), to solve the constructed product MDP with the specified reward function to obtain the product MDP's optimal policy.…”

Section: F1 Details Of Methodologymentioning

confidence: 99%

“…We then characterize each reinforcement-learning algorithm as a "reward-scheme" and "learning-algorithm" pair. We consider a total of five reward-schemes 4 : Rewardon-acc , Multi-discount , Zeta-reach , Zeta-acc (Hahn et al 2020), and Zeta-discount (Hahn et al 2020). We consider a total of three learning-algorithms: Q-learning (Watkins and Dayan 1992), Double Q-learning (Hasselt 2010), and SARSA(λ) (Sutton 1988).…”

Section: F1 Details Of Methodologymentioning

confidence: 99%

See 1 more Smart Citation

On the (In)Tractability of Reinforcement Learning for LTL Objectives

Yang¹,

Littman²,

Carbin³

2021

Preprint

View full text Add to dashboard Cite

In recent years, researchers have made significant progress in devising reinforcement-learning algorithms for optimizing linear temporal logic (LTL) objectives and LTL-like objectives. Despite these advancements, there are fundamental limitations to how well this problem can be solved that previous studies have alluded to but, to our knowledge, have not examined in depth. In this paper, we address theoretically the hardness of learning with general LTL objectives. We formalize the problem under the probably approximately correct learning in Markov decision processes (PAC-MDP) framework, a standard framework for measuring sample complexity in reinforcement learning. In this formalization, we prove that the optimal policy for any LTL formula is PAC-MDPlearnable only if the formula is in the most limited class in the LTL hierarchy, consisting of only finite-horizon-decidable properties. Practically, our result implies that it is impossible for a reinforcement-learning algorithm to obtain a PAC-MDP guarantee on the performance of its learned policy after finitely many interactions with an unconstrained environment for non-finite-horizon-decidable LTL objectives.

show abstract

Section: F1 Details Of Methodologymentioning

confidence: 99%

Section: F1 Details Of Methodologymentioning

confidence: 99%

On the (In)Tractability of Reinforcement Learning for LTL Objectives

Yang¹,

Littman²,

Carbin³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…-Dense limit-reachability. The dense limit-reachability reward scheme [12] connects the approaches of [11] and [3]. This reward scheme is identical to [11] except for giving a +1 reward given every time an accepting transition is seen, instead of only when the transition to the sink succeeds.…”

Section: Overview Of Mungojerriementioning

confidence: 99%

“…The tool also has methods for performing probabilistic model checking (including end-component decomposition, stochastic shortest-path, and discounted-reward optimization) of ω-regular objectives on the same data structures used for learning. Mungojerrie also provides reference implementations of several reward schemes [11,12,14,19,23] proposed by the formal methods community. Mungojerrie is packaged with over 100 benchmarks and outputs GraphViz [8] for easy visualization of small models and automata.…”

Section: Introductionmentioning

confidence: 99%

Mungojerrie: Linear-Time Objectives in Model-Free Reinforcement Learning

Hahn

Perez

Schewe

et al. 2023

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Mungojerrie is an extensible tool that provides a framework to translate linear-time objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for $$\omega $$ ω -regular objectives, and an internal probabilistic model checker for $$\omega $$ ω -regular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and $$\omega $$ ω -automata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for $$\omega $$ ω -regular objectives in RL.

show abstract

“…We obtain a Markov decision process (MDP) whose transitions are labeled with multiple separate Büchi conditions. Then, generalizing a method described in previous work [13], we transform different combinations of Büchi conditions into different rewards in a reduction to a weighted reachability problem that depends on the prioritisation. We end up with an MDP equipped with a standard scalar reward, to which general RL algorithms can be applied.…”

Section: Introductionmentioning

confidence: 99%

Model-Free Reinforcement Learning for Lexicographic Omega-Regular Objectives

Hahn

Perez

Schewe

et al. 2021

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

We study the problem of finding optimal strategies in Markov decision processes with lexicographic ω-regular objectives, which are ordered collections of ordinary ω-regular objectives. The goal is to compute strategies that maximise the probability of satisfaction of the first ω-regular objective; subject to that, the strategy should also maximise the probability of satisfaction of the second ω-regular objective; then the third and so forth. For instance, one may want to guarantee critical requirements first, functional ones second and only then focus on the non-functional ones. We show how to harness the classic off-the-shelf model-free reinforcement learning techniques to solve this problem and evaluate their performance on four case studies.

show abstract

Faithful and Effective Reward Schemes for Model-Free Reinforcement Learning of Omega-Regular Objectives

Cited by 19 publications

References 32 publications

On the (In)Tractability of Reinforcement Learning for LTL Objectives

On the (In)Tractability of Reinforcement Learning for LTL Objectives

Mungojerrie: Linear-Time Objectives in Model-Free Reinforcement Learning

Model-Free Reinforcement Learning for Lexicographic Omega-Regular Objectives

Contact Info

Product

Resources

About