2018
DOI: 10.48550/arxiv.1809.09318
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Floyd-Warshall Reinforcement Learning: Learning from Past Experiences to Reach New Goals

Abstract: Consider mutli-goal tasks that involve static environments and dynamic goals. Examples of such tasks, such as goaldirected navigation and pick-and-place in robotics, abound. Two types of Reinforcement Learning (RL) algorithms are used for such tasks: model-free or model-based. Each of these approaches has limitations. Model-free RL struggles to transfer learned information when the goal location changes, but achieves high asymptotic accuracy in single goal tasks. Model-based RL can transfer learned information… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 19 publications
0
7
0
Order By: Relevance
“…However, as was already observed in [11], the FW relaxation requires that the values always over-estimate the optimal costs, and any under-estimation error, due to noise or function approximation, gets propagated through the algorithm without any way of recovery, leading to instability. Indeed, both [11,6] showed results only for table-lookup value functions, and in our experiments we have found that replacing the STDP update with a FW relaxation (reported in the supplementary) leads to instability when used with function approximation. On the other hand, the complexity of STDP is O(N 3 log N ), but the explicit dependence on k in the value function allows for a stable update when using function approximation.…”
Section: Algorithmmentioning
confidence: 72%
See 2 more Smart Citations
“…However, as was already observed in [11], the FW relaxation requires that the values always over-estimate the optimal costs, and any under-estimation error, due to noise or function approximation, gets propagated through the algorithm without any way of recovery, leading to instability. Indeed, both [11,6] showed results only for table-lookup value functions, and in our experiments we have found that replacing the STDP update with a FW relaxation (reported in the supplementary) leads to instability when used with function approximation. On the other hand, the complexity of STDP is O(N 3 log N ), but the explicit dependence on k in the value function allows for a stable update when using function approximation.…”
Section: Algorithmmentioning
confidence: 72%
“…, s N , and denote by c(s, s ) ≥ 0 the weight of edge s → s . 6 To simplify notation, we replace unconnected edges by edges with weight ∞, creating a complete graph. The APSP problem seeks the shortest paths (i.e., a path with minimum sum of costs) from any start node s to any goal node g in the graph.…”
Section: A Dynamic Programming Principle For Sub-goal Treesmentioning
confidence: 99%
See 1 more Smart Citation
“…In goal-conditioned RL, a policy is given a goal state, and must take actions to reach that state [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. However, as discussed previously, many tasks cannot be specified with a single goal state.…”
Section: Related Workmentioning
confidence: 99%
“…In this case, robots are rewarded (i.e. the regret is zero) as they reach a distance of at most δ from the target (see (Dhiman, Banerjee, Siskind, & Corso, 2018) and references therein). In our setting, unlike the target point, the threshold θ n is not known, and the parameter δ is defined with respect to the range of UI or the state space of the MDP.…”
Section: The δ -Regretmentioning
confidence: 99%