Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1010
|View full text |Cite
|
Sign up to set email alerts
|

Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Abstract: Dialog policy decides what and how a taskoriented dialog system will respond, and plays a vital role in delivering effective conversations.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
60
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 60 publications
(61 citation statements)
references
References 27 publications
0
60
0
1
Order By: Relevance
“…The researchers generally hire human users on the crowdsourcing platform, and human evaluation can be conducted in the following two ways. One is indirect evaluation that asking the annotators to read the simulated dialog between the dialog system and the user simulator, then rate the score [39] or give their preference among different systems [65] according to each metric. The other one is direct evaluation that the participants are asked to interact with the system to complete a certain task, give their ratings on the interaction experience.…”
Section: Human Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…The researchers generally hire human users on the crowdsourcing platform, and human evaluation can be conducted in the following two ways. One is indirect evaluation that asking the annotators to read the simulated dialog between the dialog system and the user simulator, then rate the score [39] or give their preference among different systems [65] according to each metric. The other one is direct evaluation that the participants are asked to interact with the system to complete a certain task, give their ratings on the interaction experience.…”
Section: Human Evaluationmentioning
confidence: 99%
“…Instead of estimating the reward signals through annotated labels, Inverse RL (IRL) aims to recover the reward function by observing expert demonstrations. Adversarial learning is often adopted for dialog reward estimation through distinguishing simulated and real user dialogs [64,65,95].…”
Section: User Goal Estimationmentioning
confidence: 99%
“…apply RL to optimize dialogue systems; in particular, they optimize handcrafted reward signals such as ease of answering, information flow, and semantic coherence. A number of RL methods, including Q-learning (Peng et al, 2017;Lipton et al, 2018;Li et al, 2017a;Su et al, 2018) and policy gradient methods (Dhingra et al, 2016;Williams et al, 2017;Takanobu et al, 2019), have been applied to optimize dialogue policies by interacting with real users or user simulators. With the help of RL, the dialogue agent is able to explore contexts that may not exist in previously observed data.…”
Section: Optimizing Interactive Systemsmentioning
confidence: 99%
“…Monfort et al (2015) use IRL to predict human motion when interacting with the environment. IRL has also been applied to dialogues to extract the reward function and model the user (Pietquin, 2013;Takanobu et al, 2019;Li et al, 2020Li et al, , 2019. IRL is used to model user behavior in order to make predictions about it.…”
Section: Rewards For Interactive Systemsmentioning
confidence: 99%
“…The core of SDS, dialogue management, can be formulated as an RL problem (Levin et al, 1997;Young et al, 2013;Williams, 2008). Great advancements can be achieved with deep RL algorithms (Dhingra et al, 2016;Chang et al, 2017;Takanobu et al, 2019;Wu et al, 2020). Yet, deep RL methods are notoriously expensive in terms of the number of interactions they require.…”
Section: Introductionmentioning
confidence: 99%