Floyd-Warshall Reinforcement Learning: Learning from Past Experiences to Reach New Goals

Dhiman, Vikas; Banerjee, Shurjo; Siskind, Jeffrey Mark; Corso, Jason J.

doi:10.48550/arxiv.1809.09318

Cited by 3 publications

(7 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, as was already observed in [11], the FW relaxation requires that the values always over-estimate the optimal costs, and any under-estimation error, due to noise or function approximation, gets propagated through the algorithm without any way of recovery, leading to instability. Indeed, both [11,6] showed results only for table-lookup value functions, and in our experiments we have found that replacing the STDP update with a FW relaxation (reported in the supplementary) leads to instability when used with function approximation. On the other hand, the complexity of STDP is O(N 3 log N ), but the explicit dependence on k in the value function allows for a stable update when using function approximation.…”

Section: Algorithmmentioning

confidence: 72%

“…, s N , and denote by c(s, s ) ≥ 0 the weight of edge s → s . 6 To simplify notation, we replace unconnected edges by edges with weight ∞, creating a complete graph. The APSP problem seeks the shortest paths (i.e., a path with minimum sum of costs) from any start node s to any goal node g in the graph.…”

Section: A Dynamic Programming Principle For Sub-goal Treesmentioning

confidence: 99%

“…If the updates are performed over all s m , s, and s (in that sequence), V F W will converge to the shortest path, requiring O(N 3 ) computations [27]. One can also perform relaxations in an arbitrary order, as was suggested by Kaelbling [11], and more recently in [6], to result in an RL style algorithm. However, as was already observed in [11], the FW relaxation requires that the values always over-estimate the optimal costs, and any under-estimation error, due to noise or function approximation, gets propagated through the algorithm without any way of recovery, leading to instability.…”

Section: Algorithmmentioning

confidence: 99%

See 2 more Smart Citations

Sub-Goal Trees -- a Framework for Goal-Directed Trajectory Prediction and Optimization

Jurgenson,

Groshev,

Tamar

2019

Preprint

View full text Add to dashboard Cite

Many AI problems, in robotics and other domains, are goal-directed, essentially seeking a trajectory leading to some goal state. In such problems, the way we choose to represent a trajectory underlies algorithms for trajectory prediction and optimization. Interestingly, most all prior work in imitation and reinforcement learning builds on a sequential trajectory representation -calculating the next state in the trajectory given its predecessors. We propose a different perspective: a goal-conditioned trajectory can be represented by first selecting an intermediate state between start and goal, partitioning the trajectory into two. Then, recursively, predicting intermediate points on each sub-segment, until a complete trajectory is obtained. We call this representation a sub-goal tree, and building on it, we develop new methods for trajectory prediction, learning, and optimization. We show that in a supervised learning setting, sub-goal trees better account for trajectory variability, and can predict trajectories exponentially faster at test time by leveraging a concurrent computation. Then, for optimization, we derive a new dynamic programming equation for sub-goal trees, and use it to develop new planning and reinforcement learning algorithms. These algorithms, which are not based on the standard Bellman equation, naturally account for hierarchical sub-goal structure in a task. Empirical results on motion planning domains show that the sub-goal tree framework significantly improves both accuracy and prediction time.Preprint. Under review.

show abstract

Section: Algorithmmentioning

confidence: 72%

Section: A Dynamic Programming Principle For Sub-goal Treesmentioning

confidence: 99%

Section: Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

Sub-Goal Trees -- a Framework for Goal-Directed Trajectory Prediction and Optimization

Jurgenson,

Groshev,

Tamar

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In goal-conditioned RL, a policy is given a goal state, and must take actions to reach that state [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. However, as discussed previously, many tasks cannot be specified with a single goal state.…”

Section: Related Workmentioning

confidence: 99%

DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies

Nasiriany¹,

Pong²,

Nair³

et al. 2021

Preprint

View full text Add to dashboard Cite

Can we use reinforcement learning to learn general-purpose policies that can perform a wide range of different tasks, resulting in flexible and reusable skills? Contextual policies provide this capability in principle, but the representation of the context determines the degree of generalization and expressivity. Categorical contexts preclude generalization to entirely new tasks. Goal-conditioned policies may enable some generalization, but cannot capture all tasks that might be desired. In this paper, we propose goal distributions as a general and broadly applicable task representation suitable for contextual policies. Goal distributions are general in the sense that they can represent any state-based reward function when equipped with an appropriate distribution class, while the particular choice of distribution class allows us to trade off expressivity and learnability. We develop an off-policy algorithm called distribution-conditioned reinforcement learning (DisCo RL) to efficiently learn these policies. We evaluate DisCo RL on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.

show abstract

“…In this case, robots are rewarded (i.e. the regret is zero) as they reach a distance of at most δ from the target (see (Dhiman, Banerjee, Siskind, & Corso, 2018) and references therein). In our setting, unlike the target point, the threshold θ n is not known, and the parameter δ is defined with respect to the range of UI or the state space of the MDP.…”

Section: The δ -Regretmentioning

confidence: 99%

Sequential Choice Bandits with Feedback for Personalizing users' experience

Rangi,

Franceschetti,

Tran-Thanh

2021

Preprint

View full text Add to dashboard Cite

In this work, we study sequential choice bandits with feedback. We propose bandit algorithms for a platform that personalizes users' experience to maximize its rewards. For each action directed to a given user, the platform is given a positive reward, which is a non-decreasing function of the action, if this action is below the user's threshold. Users are equipped with a patience budget, and actions that are above the threshold decrease the user's patience. When all patience is lost, the user abandons the platform. The platform attempts to learn the thresholds of the users in order to maximize its rewards, based on two different feedback models describing the information pattern available to the platform at each action. We define a notion of regret by determining the best action to be taken when the platform knows that the user's threshold is in a given interval. We then propose bandit algorithms for the two feedback models and show that upper and lower bounds on the regret are of the order of Õ(N 2/3 ) and Ω(N 2/3 ), respectively, where N is the total number of users. Finally, we show that the waiting time of any user before receiving a personalized experience is uniform in N .

show abstract

Floyd-Warshall Reinforcement Learning: Learning from Past Experiences to Reach New Goals

Cited by 3 publications

References 19 publications

Sub-Goal Trees -- a Framework for Goal-Directed Trajectory Prediction and Optimization

Sub-Goal Trees -- a Framework for Goal-Directed Trajectory Prediction and Optimization

DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies

Sequential Choice Bandits with Feedback for Personalizing users' experience

Contact Info

Product

Resources

About