Algorithmic Survey of Parametric Value Function Approximation

Lecture Notes in Computer Science

Guo

et al. 2018

Due to the popularity of smartphones and wearable devices nowadays, mobile health (mHealth) technologies are promising to bring positive and wide impacts on people's health. State-of-the-art decisionmaking methods for mHealth rely on some ideal assumptions. Those methods either assume that the users are completely homogenous or completely heterogeneous. However, in reality, a user might be similar with some, but not all, users. In this paper, we propose a novel group-driven reinforcement learning method for the mHealth. We aim to understand how to share information among similar users to better convert the limited user information into sharper learned RL policies. Specifically, we employ the K-means clustering method to group users based on their trajectory information similarity and learn a shared RL policy for each group. Extensive experiment results have shown that our method can achieve clear gains over the state-of-the-art RL methods for mHealth.

Section: Preliminariesmentioning

confidence: 99%

Group-Driven Reinforcement Learning for Personalized mHealth Intervention

Lecture Notes in Computer Science

Guo

et al. 2018

“…the environment) that RL interacts with is generally modeled as a Markov Decision Process (MDP) [8]. An MDP is a tuple {S, A, P, R, γ} [9], [10], [11], where S is (finite) state space and A is (finite) action space. The state transition probability P : S ×A×S → [0, 1], from state s to the next state s ′ when taking action a, is given by P (s, a, s ′ ).…”

Section: Markov Decision Process (Mdp) and Actor-critic Reinforcementioning

confidence: 99%

Cohesion-driven Online Actor-Critic Reinforcement Learning for mHealth Intervention

Liao

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

et al. 2018

Online reinforcement learning (RL) is increasingly popular for the personalized mobile health (mHealth) intervention. It is able to personalize the type and dose of interventions according to user's ongoing statuses and changing needs. However, at the beginning of online learning, there are usually too few samples to support the RL updating, which leads to poor performances. A delay in good performance of the online learning algorithms can be especially detrimental in the mHealth, where users tend to quickly disengage with the mHealth app. To address this problem, we propose a new online RL methodology that focuses on an effective warm start. The main idea is to make full use of the data accumulated and the decision rule achieved in a former study. As a result, we can greatly enrich the data size at the beginning of online learning in our method. Such case accelerates the online learning process for new users to achieve good performances not only at the beginning of online learning but also through the whole online learning process. Besides, we use the decision rules achieved in a previous study to initialize the parameter in our online RL model for new users. It provides a good initialization for the proposed online RL algorithm.Experiment results show that promising improvements have been achieved by our method compared with the state-of-the-art method.

2015 International Joint Conference on Neural Networks (IJCNN)

“…In LSPI, the action-state value function Q7r is approximated by a linear parametric architecture with free parameters Wi h Q 7r i(x,a) = L1> j(x,a) w j = � (x,a) wi' (11) j=l where i(x, a) E JR h denotes the vector of basis functions or features, and Wi = [W1' W 2 , ... , W h]T denotes the weight vector. The parameter vector Wi can be adjusted appropriately so that the approximate value function is close enough to the exact one.…”

Section: Problem Statementmentioning

confidence: 99%

Approximate policy iteration with unsupervised feature learning based on manifold regularization

Liu

Wang

2015

In this paper, we develop a novel approximate policy iteration reinforcement learning algorithm with unsupervised feature learning based on manifold regularization. The proposed algorithm can automatically learn data-driven smooth basis rep resentations for value function approximation, which can preserve the intrinsic geometry of the state space of Markov decision processes. Moreover, it can provide a direct basis extension for new samples in both policy learning and policy control processes. We evaluate the effectiveness and efficiency of the proposed algorithm on the inverted pendulum task. Simulation results show that this algorithm can learn smooth basis representations and excellent control policies. I. IN TRODUCTIONReinforcement learning (RL) [1 ] is a computational ap proach which can solve goal-directed sequential decision mak ing problems described by Markov decision processes (MDPs). Although dynamic programming [2] is a standard approach to solve MDPs, it suffers from the "curse of dimensionality" and requires the knowledge of the system model. RL algorithms [3] are practical for MDPs with large discrete or continuous state spaces, and can also deal with the learning scenario when the model is unknown. A closely related topic is adaptive or approximate dynamic programming [4]-[9] which adopts a control-theoretic point of view and terminology.Batch RL [1 0] is a subfield of dynamic programming based RL, and it allows to solve MDPs by solving a series of supervised learning problems. The goal of batch RL is to learn a best possible policy from the given training data collecting from the unknown system. Therefore, it can make more efficient use of data and avoid stability issues. However, a major challenge is that it is infeasible to represent the solutions exactly for MDPs with large discrete or continuous state spaces. Approximate policy iteration (API) with function approximation methods [1 1] can provide a compact represen tation for value function by storing only the parameters of the approximator. API [1 2] starts from an initial policy, and iterates between policy evaluation and policy improvement to find an approximate solution to the fixed point of Bellman optimality equation. Bradtke and Barto [l3] proposed a popular Least Squares Temporal Difference (LSTD) algorithm to performThe authors are with The State