Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.8
|View full text |Cite
|
Sign up to set email alerts
|

[CASPI] Causal-aware Safe Policy Improvement for Task-oriented Dialogue

Abstract: The recent success of reinforcement learning (RL) in solving complex tasks is often attributed to its capacity to explore and exploit an environment. Sample efficiency is usually not an issue for tasks with cheap simulators to sample data online. On the other hand, Taskoriented Dialogues (ToD) are usually learnt from offline data collected using human demonstrations. Collecting diverse demonstrations and annotating them is expensive. Unfortunately, RL policy trained on off-policy data are prone to issues of bi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…As a result, KRLS does not perform exploration/planning on a dialog-level, which can be very useful for tasks that require long-horizon planning to be successful (e.g., persuading a person to donate to a charity ). We believe one way to extend KRLS could be using a GPT-like model to learn from an entire dialog, and combine with safe policy improvement methods to avoid potentially large bias and poor sample efficiency during dialog-level RL learning (Ramachandran et al, 2022). We leave this for future work.…”
Section: Limitationsmentioning
confidence: 99%
See 1 more Smart Citation
“…As a result, KRLS does not perform exploration/planning on a dialog-level, which can be very useful for tasks that require long-horizon planning to be successful (e.g., persuading a person to donate to a charity ). We believe one way to extend KRLS could be using a GPT-like model to learn from an entire dialog, and combine with safe policy improvement methods to avoid potentially large bias and poor sample efficiency during dialog-level RL learning (Ramachandran et al, 2022). We leave this for future work.…”
Section: Limitationsmentioning
confidence: 99%
“…For appropriateness and fluency, we followed the definitions from prior work(Zhang et al, 2020b;Ramachandran et al, 2021;Jang et al, 2022;Feng et al, 2023).…”
mentioning
confidence: 99%
“…The method is however only demonstrated on a very small state and action space, and it is therefore unclear whether it generalizes to more complex set ups. Ramachandran et al (2021) applied offline RL with a pair-wise reward learning model based on preference learning, however it still utilizes the corpus-based success rate for choosing the preferred rollout. To the best of our knowledge, offline RL has not previously been deployed for dialogue evaluation.…”
Section: Related Workmentioning
confidence: 99%