In this paper, we develop a novel approximate policy iteration reinforcement learning algorithm with unsupervised feature learning based on manifold regularization. The proposed algorithm can automatically learn data-driven smooth basis rep resentations for value function approximation, which can preserve the intrinsic geometry of the state space of Markov decision processes. Moreover, it can provide a direct basis extension for new samples in both policy learning and policy control processes. We evaluate the effectiveness and efficiency of the proposed algorithm on the inverted pendulum task. Simulation results show that this algorithm can learn smooth basis representations and excellent control policies.
I. IN TRODUCTIONReinforcement learning (RL) [1 ] is a computational ap proach which can solve goal-directed sequential decision mak ing problems described by Markov decision processes (MDPs). Although dynamic programming [2] is a standard approach to solve MDPs, it suffers from the "curse of dimensionality" and requires the knowledge of the system model. RL algorithms [3] are practical for MDPs with large discrete or continuous state spaces, and can also deal with the learning scenario when the model is unknown. A closely related topic is adaptive or approximate dynamic programming [4]-[9] which adopts a control-theoretic point of view and terminology.Batch RL [1 0] is a subfield of dynamic programming based RL, and it allows to solve MDPs by solving a series of supervised learning problems. The goal of batch RL is to learn a best possible policy from the given training data collecting from the unknown system. Therefore, it can make more efficient use of data and avoid stability issues. However, a major challenge is that it is infeasible to represent the solutions exactly for MDPs with large discrete or continuous state spaces. Approximate policy iteration (API) with function approximation methods [1 1] can provide a compact represen tation for value function by storing only the parameters of the approximator. API [1 2] starts from an initial policy, and iterates between policy evaluation and policy improvement to find an approximate solution to the fixed point of Bellman optimality equation. Bradtke and Barto [l3] proposed a popular Least Squares Temporal Difference (LSTD) algorithm to performThe authors are with The State