“…However, while these approaches have shown significant success in a number of domains [7,10,9,32], learning from purely offline data leads to a trajectory distribution mismatch which yields suboptimal performance both in theory and practice [12,13]. To address this problem, there have been a number of approaches that utilize online human feedback while the agent acts in the environment, such as providing suggested actions [12,35,36,17] or preferences [37,38,39,40,41,42]. However, many of these forms of human feedback may be unreliable if the robot visits states that significantly differ from those the human supervisor would themselves visit; in such situations, it is challenging for the supervisor to determine what correct behavior should look like without directly interacting with the environment [16,43].…”