“…The Q-function, Q i (s, a), of Task i estimates the expected discounted return of the policy after taking action a at state s (Watkins & Dayan, 1992). Although this is an estimate acquired during training, it is a critical component in many state-of-the-art RL algorithms (Haarnoja et al, 2018;Lillicrap et al, 2015) and has been used to filter for high-quality data in multi-task (Yu et al, 2021) and imitation learning settings (Nair et al, 2018;Sasaki & Yamashina, 2020), which suggests the Q-function is still very effective for evaluating and comparing actions during training. Unlike single-task RL, we use the Q-function as a switch that rates action proposals from other tasks' policies for the current task's state s. This simple and intuitive function is state and task-dependent, gives the current best estimate of which behaviors are most helpful, and is quickly adaptive to changes in its own and other policies during online learning.…”