Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
The purpose of this technical report is two-fold. First of all, it introduces a suite of challenging continuous control tasks (integrated with OpenAI Gym) based on currently existing robotics hardware. The tasks include pushing, sliding and pick & place with a Fetch robotic arm as well as in-hand object manipulation with a Shadow Dexterous Hand. All tasks have sparse binary rewards and follow a Multi-Goal Reinforcement Learning (RL) framework in which an agent is told what to do using an additional input. The second part of the paper presents a set of concrete research ideas for improving RL algorithms, most of which are related to Multi-Goal RL and Hindsight Experience Replay.
Adoptive immunotherapy using chimeric antigen receptor–modified T cells (CAR-T) has made substantial contributions to the treatment of certain B cell malignancies. Such treatment modalities could potentially obviate the need for long-term antiretroviral drug therapy in HIV/AIDS. Here, we report the development of HIV-1–based lentiviral vectors that encode CARs targeting multiple highly conserved sites on the HIV-1 envelope glycoprotein using a two-molecule CAR architecture, termed duoCAR. We show that transduction with lentiviral vectors encoding multispecific anti-HIV duoCARs confer primary T cells with the capacity to potently reduce cellular HIV infection by up to 99% in vitro and >97% in vivo. T cells are the targets of HIV infection, but the transduced T cells are protected from genetically diverse HIV-1 strains. The CAR-T cells also potently eliminated PBMCs infected with broadly neutralizing antibody-resistant HIV strains, including VRC01/3BNC117-resistant HIV-1. Furthermore, multispecific anti-HIV duoCAR-T cells demonstrated long-term control of HIV infection in vivo and prevented the loss of CD4+T cells during HIV infection using a humanized NSG mouse model of intrasplenic HIV infection. These data suggest that multispecific anti-HIV duoCAR-T cells could be an effective approach for the treatment of patients with HIV-1 infection.
Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum. We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task. The video presenting our experiments is available at https://goo.gl/SMrQnI.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.