2019
DOI: 10.48550/arxiv.1902.00778
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Certified Reinforcement Learning with Logic Guidance

Abstract: This paper proposes the first model-free Reinforcement Learning (RL) framework to synthesise policies for unknown, and continuous-state Markov Decision Processes (MDPs), such that a given linear temporal property is satisfied. We convert the given property into a Limit Deterministic Büchi Automaton (LDBA), namely a finite-state machine expressing the property. Exploiting the structure of the LDBA, we shape a synchronous reward function on-the-fly, so that an RL algorithm can synthesise a policy resulting in tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
28
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 17 publications
(29 citation statements)
references
References 94 publications
(174 reference statements)
1
28
0
Order By: Relevance
“…Such property is the innovation of EP-MDP that encourages all accepting sets to be visited in each round. Since the action is selected randomly, there is a non-zero probability that x P violates the LTL task (line [14][15][16].…”
Section: Embedded Product Mdpmentioning
confidence: 99%
See 3 more Smart Citations
“…Such property is the innovation of EP-MDP that encourages all accepting sets to be visited in each round. Since the action is selected randomly, there is a non-zero probability that x P violates the LTL task (line [14][15][16].…”
Section: Embedded Product Mdpmentioning
confidence: 99%
“…The upper bound m = max M M < P tr π P R rec π and m ≥ 0, where P is a block matrix whose nonzero entries are derived similarly to p in (14). The utility U tr π (x R ) can be lower bounded from ( 14) and (15) as…”
Section: Base Rewardmentioning
confidence: 99%
See 2 more Smart Citations
“…Research regarding goal-conditioned polices was introduced both in the context of RL [11], and in the context of model-based non-stationary control [12]. Ordered-goal tasks have also been recently proposed in RL for training agents with symbolic and temporal rules [13][14][15][16][17][18][19][20]. Policy-concatenation (stitching) has also been proposed as a means of skill-learning in the Options framework [21][22][23], as well as Hierarchical Control [24,12], and other schemes for concatenating polices such as policy sketches (using modular polices and task-specific reward functions) [25], or compositional plan vectors [26].…”
Section: Introductionmentioning
confidence: 99%