“…The reward-free RL setting seeks to explore an MDP without access to a reward function, in order to then determine a near-optimal policy for an arbitrary reward function. This has been studied in the tabular setting (Jin et al, 2020a;Ménard et al, 2020;Zhang et al, 2020a;Wu et al, 2021), where the optimal scaling is known to be Θ(|S| 2 |A|/ 2 ) (Jin et al, 2020a), as well as the function approximation setting (Zanette et al, 2020c;Wang et al, 2020;Zhang et al, 2021a). In the linear MDP setting, Wang et al (2020) show a sample complexity of O( d 3 H 6 2 ), and Zanette et al (2020c) show a complexity of…”