The Markov game framework is widely used to model interactions among agents with heterogeneous utilities in dynamic, uncertain, societal-scale systems. In these settings, agents typically operate in a decentralized manner due to privacy and scalability concerns, often without knowledge of others' strategies. Designing decentralized learning algorithms that provably converge to rational outcomes remains challenging, especially beyond Markov zero-sum and potential games, which do not fully capture the mixed cooperative-competitive nature of real-world interactions. Our paper focuses on designing decentralized learning algorithms for general-sum Markov games, aiming to provide guarantees of convergence to approximate Nash equilibria. We introduce a Markov Near-Potential Function (MNPF), and show that MNPF plays a central role in the analysis of convergence of an actor-critic-based decentralized learning dynamics to approximate Nash equilibria. Our analysis leverages the two-timescale nature of actor-critic algorithms, where Q-function updates occur faster than policy updates. This result is further strengthened under certain regularity conditions and when the set of Nash equilibria is finite. Our findings provide a new perspective on the analysis of decentralized learning in multi-agent systems, addressing the complexities of real-world interactions.
We propose a multi-agent reinforcement learning dynamics, and analyze its convergence properties in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players can only observe the realized state and their own reward in every stage. Players do not have knowledge of the game model, and cannot coordinate with each other. In each stage of our learning dynamics, players update their estimate of a perturbed Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner.Then, players independently update their policies by incorporating a smoothed optimal one-stage deviation strategy based on the estimated Q-function. A key feature of the learning dynamics is that the Q-function estimates are updated at a faster timescale than the policies. We prove that the policies induced by our learning dynamics converge to a stationary Nash equilibrium in Markov potential games with probability 1.Our results build on the theory of two timescale asynchronous stochastic approximation, and new analysis on the monotonicity of potential function along the trajectory of policy updates in Markov potential games.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.