2019
DOI: 10.1609/aaai.v33i01.33017289
|View full text |Cite
|
Sign up to set email alerts
|

Switch-Based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning

Abstract: Training task-completion dialogue agents with reinforcement learning usually requires a large number of real user experiences. The Dyna-Q algorithm extends Q-learning by integrating a world model, and thus can effectively boost training efficiency using simulated experiences generated by the world model. The effectiveness of Dyna-Q, however, depends on the quality of the world model -or implicitly, the pre-specified ratio of real vs. simulated experiences used for Q-learning. To this end, we extend the recentl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
40
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 40 publications
(41 citation statements)
references
References 14 publications
0
40
0
1
Order By: Relevance
“…However, the user simulator is not able to fully mimic real human conversation behaviors, and its inductive bias may lead to sub-optimal models that perform poorly in real human conversation. To alleviate these problems, model-based RL methods are proposed to model the environment, enabling planning for dialog policy learning [40][41][42]. In model-based RL approaches, the environment is modeled to simulate the dynamics of the conversation.…”
Section: Dialog Policymentioning
confidence: 99%
“…However, the user simulator is not able to fully mimic real human conversation behaviors, and its inductive bias may lead to sub-optimal models that perform poorly in real human conversation. To alleviate these problems, model-based RL methods are proposed to model the environment, enabling planning for dialog policy learning [40][41][42]. In model-based RL approaches, the environment is modeled to simulate the dynamics of the conversation.…”
Section: Dialog Policymentioning
confidence: 99%
“…The ones where the discriminator failed to identify or had difficulty detecting the simulated experiences from the real ones were then used in the policy learning phase of the VA. In [30], authors presented yet another variant of Deep Dyna-Q framework [28] called Switch-based Active Deep Dyna-Q to counter the problem of low quality simulated user experience of the world model and the sample efficiency of the Dyna-Q framework. They incorporated a switcher and an active sampling strategy to determine when to use real or simulated user experience depending on different phases of dialogue policy training and generate those simulated user experiences that have not been fully explored by the VA.…”
Section: Plos Onementioning
confidence: 99%
“…Reference [32] proposed a new model-based reinforcement learning approach, discriminative deep dyna-Q (D3Q), for task-completion dialogue policy learning. Reference [33] presented a new reinforcement learning framework, switch-based active deep dyna-Q (Switch-DDQ), for task-completion dialogue policy learning.…”
Section: Generative Adversarial Netsmentioning
confidence: 99%