2022
DOI: 10.48550/arxiv.2206.11871
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Offline RL for Natural Language Generation with Implicit Language Q Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(15 citation statements)
references
References 0 publications
0
15
0
Order By: Relevance
“…Reinforcement Learning from Feedback RL has been applied to enhance various models in NLP tasks such as machine translation [117], summarization [18], dialogue generation [118], image captioning [119], question generation [120], text-games [121], and more [122,123,124]. RL is a helpful method for optimizing non-differentiable objectives in language generation tasks by treating them as sequential decisionmaking problems.…”
Section: Instruction-aligning Methodsmentioning
confidence: 99%
“…Reinforcement Learning from Feedback RL has been applied to enhance various models in NLP tasks such as machine translation [117], summarization [18], dialogue generation [118], image captioning [119], question generation [120], text-games [121], and more [122,123,124]. RL is a helpful method for optimizing non-differentiable objectives in language generation tasks by treating them as sequential decisionmaking problems.…”
Section: Instruction-aligning Methodsmentioning
confidence: 99%
“…On the other hand, Offline RL (Fujimoto et al, 2019;Kumar et al, 2020;Brandfonbrener et al, 2021;Kostrikov et al, 2021) removes all need for environment interaction or user simulators, instead operating purely on static datasets of prior human interaction. There are many closely related works (Jaques et al, 2019(Jaques et al, , 2020Snell et al, 2022;Cohen et al, 2022;Verma et al, 2022;Jang et al, 2022) based on offline RL that policy improvement via behavior cloning of self-generated utterances, which inherits the ability of pre-trained language models to generate human-like responses. In RL parlance, such methods could be considered policy extraction with approximate dynamic programming.…”
Section: Related Workmentioning
confidence: 99%
“…Drawing inspiration from the work of [32], we formalize our problem as a Partially Observable Markov Decision Process [33]. At each timestep t, the agent (i.e.…”
Section: Rl Task Formulationmentioning
confidence: 99%