We consider a subclass of n-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can only receive realizations of their payoffs but not the actual functions, nor can they observe each others' states/actions. Under some assumptions on the structure of the payoff functions, we develop efficient learning algorithms based on Dual Averaging and Dual Mirror Descent, which provably converge almost surely or in expectation to the set of ǫ-Nash equilibrium policies. In particular, we derive upper bounds on the number of iterates that scale polynomially in terms of the game parameters to achieve an ǫ-Nash equilibrium policy. Besides Markov potential games and linear-quadratic stochastic games, this work provides another interesting subclass of n-player stochastic games that provably admit polynomial-time learning algorithm for finding their ǫ-Nash equilibrium policies.
Index TermsStochastic games; stationary Nash equilibrium, dual averaging, dual mirror descent, learning in games.
I. INTRODUCTIONSince the early work on the existence of a mixed-strategy Nash equilibrium in static noncooperative games [1], and its extension on the existence of stationary Nash equilibrium policies in dynamic stochastic games [2], substantial research has been done to develop scalable algorithms for computing Nash equilibrium (NE) points in static and dynamic environments. NE provides a stable solution concept for strategic multiagent decision-making systems, which is a desirable property in many applications such as socioeconomic systems [3], network security [4], routing and scheduling [5], among many others [6], [7].Unfortunately, computing NE is generally PPAD-hard [8], and it is unlikely to admit a polynomial-time algorithm. Thus, to overcome this fundamental barrier, two main approaches have been adapted in the past literature: i) searching for relaxed notions of stable solutions such as correlated equilibrium [9], which includes the set of NE, and ii) searching for NE points in special structured games such as potential games [10], or concave games [11]. Thanks to recent advances in the field of learning theory, it is known that some tailored algorithms for finding relaxed notions of equilibrium in case (i) can also be used to compute NE points of structured games in case (ii). For instance, it is known that the so-called no-regret algorithms always converge to the set of coarse correlated equilibria [7], and they can also be used to compute NE in the class of socially concave games [12]. However, such results are mainly developed for static games, in which players repeatedly play the same game and gradually learn the underlying stationary environment. Unfortunately, extensions of such results to dynamic stochastic games [2], [13], in which the state of the game evolves as a result of players' past decisions and the realizations of a stoc...