Causal Based Action Selection Policy for Reinforcement Learning

Feliciano-Avelino, Ivan; Méndez-Molina, Arquímides; Morales, Eduardo F.; Sucar, L. Enrique

doi:10.1007/978-3-030-89817-5_16

Cited by 4 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The causal information is used for different purposes: To deal with latent confounders in different settings like Multi-Armed Bandit (MAB) [15]- [20], MDP [5], [21], [22], and off-police evaluation (OPE) [23], to mitigate heterogeneity and data scarcity [6], or to derive causal explanations about the behavior of model-free RL agents [7]. More closely related with our work, in [10] and [9] it is shown how it is possible to speed-up police learning in goal-conditioned MDP settings via causal knowledge. They took inspiration from model-based reinforcement learning (MBRL), e.g., [24]- [26], but the structure of the given models and the way to use them are different.…”

Section: Related Workmentioning

confidence: 99%

“…For that reason we can not guarantee that the learned models are complete. However, it has been shown in [11] and [10] that partially correct causal models are enough to speed up policy learning. In our experiments we use the score-based structure learning algorithm Tabu Search (TABU) [38] implementation from the bnlearn R package [39].…”

Section: Causal Discovery (Cd)mentioning

confidence: 99%

“…It has been shown in previous works [9] that a causal model relating state, action and reward variables can be used to accelerate the police learning process in MDPs by guiding the action selection process, even if those models are partially correct [10]. Although the models used in these works are not DBNs, the main ideas were adapted in [11] to use a set of two-slice causal DBNs as inputs.…”

Section: E Reinforcement Learning Using Causal Models (Rl Using Cd)mentioning

confidence: 99%

“…Our hypothesis was that the agent using the RL for CD stage will discover better models in fewer episodes than the traditional RL agent independent of T . For each T value, we run 20 trials 10 for each agent and we report the structural hamming distance (shd) among the discovered causal model and the ground truth 11 This distance represents the minimum number of edge changes required (insertions, deletions, and modifications) to transform one model into another. A value of 0 indicates that the discovered model is equal to the ground truth.…”

Section: B Causal Discovery Using the New Rl For CD Stagementioning

confidence: 99%

“…Fortunately, this trend has recently shifted, leading to the emergence of a new area of research named Causal Reinforcement Learning (CRL) which leverages causal information for the benefit of RL. In different RL settings such as Markov Decision Processes (MDPs), Partially Observable Markov Decision Processes (POMDPs), and Multi Armed Bandits (MABs), CRL methods have shown promising results on deconfounding policy bias [5], sample efficiency [6], causal explanation [7], generalization [8] or speeding-up police learning [9], [10].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

CARL: A Synergistic Framework for Causal Reinforcement Learning

Méndez-Molina,

Morales,

Sucar

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

Causal Reinforcement Learning (CRL) is an emerging field where two essential areas for the development of artificial intelligence are integrated. Existing works in the area have shown how causality can contribute to mitigate some of the limitations of reinforcement learning (RL), ranging from data-inefficiency, lack of interpretability, and long learning times, among others. However, how to use reinforcement learning to support causal discovery (CD) has so far been less explored. In this article, we introduce CARL, a Causality-Aware Reinforcement Learning framework for simultaneously learning and using causal models to speedup the police learning in online Markov decision process (MDP) settings. In a synergistic way, our method alternates between: (i) (RL for CD), where it promotes the selection of actions to obtain better causal models in fewer episodes than traditional methods of obtaining data in RL, (ii) (CD), where an score-based algorithm is used to learn causal models and (iii) (RL using CD), where the learned models are used to select actions that speed up the learning of the optimal policy by reducing the number of interactions with the environment. Experiments in simulated environments show that our method achieves better results in policy learning than traditional model-free and model-based algorithms while it is also able to learn the underlying causal models. We also show how the learned causal models can be directly transferred to a similar task of greater complexity reducing significantly the number of episodes to learn an optimal policy. Finally, the method's scalability to high-dimensional states, where the action-value function needs to be represented with deep neural networks, was verified.

show abstract