Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.589
|View full text |Cite
|
Sign up to set email alerts
|

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Abstract: Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large scale experiments. Though researchers have attempted to use metrics for language generation tasks (e.g., perplexity, BLEU) or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 54 publications
0
1
0
Order By: Relevance
“…An on-policy algorithm is based on a single policy, denoted as π, which is utilized by an agent to take actions in a given state s, aiming to obtain a reward [ 48 ]. In contrast, off-policy algorithms involve the use of two policies, the target policy and the behavior policy, denoted as π and μ, respectively [ 49 , 50 ]. The target policy is the one that the agent seeks to learn and optimize, while the behavior policy generates the observations that are used for learning.…”
Section: Reinforcement Learning (Rl)mentioning
confidence: 99%
“…An on-policy algorithm is based on a single policy, denoted as π, which is utilized by an agent to take actions in a given state s, aiming to obtain a reward [ 48 ]. In contrast, off-policy algorithms involve the use of two policies, the target policy and the behavior policy, denoted as π and μ, respectively [ 49 , 50 ]. The target policy is the one that the agent seeks to learn and optimize, while the behavior policy generates the observations that are used for learning.…”
Section: Reinforcement Learning (Rl)mentioning
confidence: 99%