Many real-world applications, such as those in medical domains, recommendation systems, etc, can be formulated as large state space reinforcement learning problems with only a small budget of the number of policy changes, i.e., low switching cost. This paper focuses on the linear Markov Decision Process (MDP) recently studied in Yang and Wang [2019a], Jin et al. [2019] where the linear function approximation is used for generalization on the large state space. We present the first algorithm for linear MDP with a low switching cost. Our algorithm achieves an ‹ O Ä√ d 3 H 4 K ä regret bound with a near-optimal O (dH log K) global switching cost where d is the feature dimension, H is the planning horizon and K is the number of episodes the agent plays. Our regret bound matches the best existing polynomial algorithm by and our switching cost is exponentially smaller than theirs. When specialized to tabular MDP, our switching cost bound improves those in Bai et al. [2019], Zhang et al. [2020b. We complement our positive result with an Ω (dH/ log d) global switching cost lower bound for any no-regret algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.