Selected Adaptive Critic (AC) methods are known to be capable of designing (approximately) optimal control policies for non-linear plants (in the sense of approximating Bellman Dynamic Programming). The present research focuses on an AC method known as Dual Heuristic Programming. There are many issues related to the pragmatics of successfully applying the AC methods, but now that the operational aspects of the DHP method are becoming refined and better understood, it is instructive to carry out empirical research with the method, to inform theoretical research being carried out in parallel. In particular; it is seen as useful to explore correspondences between the form of a Utility function and the resulting controllers designed by the DHP method. The task of designing a steering controller for a 2-axle, terrestrial, autonomous vehicle is the basis of the empirical work reported here (and in a companion paper). The new aspect in the present paper relates to using a pair of critics (distinct from the shadow critics described elsewhere by the authors) to "divide the labor" of training the controller. Improvements in convergence of the training process is realized in this way. The controllers designed by the DHP method are pleasingly robust, and demonstrate good performance on disturbances not even trained on --1.) encountering a patch of ice during a steering maneuver; and 2.) encountering a wind gust perpendicular to direction of travel.
This paper for the special session on Adaptive Critic Design Methods at the SMC '97 Conference describes a modification to the (to date) usual procedures reported for training the Critic and Action neural networks in the Dual Heuristic Programming (DHP) method [7]-[12]. This modification entails updating both the Critic and the Action networks each computational cycle, rather than only one at a time. The distinction lies in the introduction of a (real) second copy of the Critic network whose weights are adjusted less often (once per "epoch", where the epoch is defined to comprise some number N>1 computational cycles), and the "desired value" for training the other Critic is obtained from this Critic-Copy.In a previous publication [4], the proposed modified training strategy was demonstrated on the well-known pole-cart controller problem. In that paper, the full 6 dimensional state vector was input to the Critic and Action NNs, however, the utility function only involved pole angle, not distance along the track (x). For the first set of results presented here, the 3 states associated with the x variable were eliminated from the inputs to the NNs, keeping the same utility function previously defined. This resulted in improved learning and controller performance. From this point, the method is applied to two additional problems, each of increasing complexity: for the first, an x-related term is added to the utility function for the polecart problem, and simultaneously, the x-related states were added back in to the NNs (i.e., increase number of state variables used from 3 to 6); the second relates to steering a vehicle with independent drive motors on each wheel. The problem contexts and experimental results are provided. BACKGROUNDDual Heuristic Programming (DHP) is a neural network approach to solving the Bellman equation [12]. The idea is to maximize a specified (secondary) utility function:The term is a discount factor ( ) and is the primary utility function, defined by the user for the specific application context. A useful identity:(2) In this paper, is assumed to be 1, and the usual meth-ods of discretizing continuous models of plants is used. For DHP, at least two neural nets are needed, one for the actionNN functioning as the controller, and one for the criticNN used to train the actionNN. A third NN could be trained to copy the plant if an analytical description (model) of the plant is not available. R(t) [dimension n] is the state of the plant at time t. The control signal u(t) [dimension a] is generated in response to the input R(t) by the actionNN. The signal u(t) is then asserted to the plant. As a result of this, the plant changes its state to R(t+1). The criticNN's role is to assist in designing a controller (actionNN) that is "good" relative to minimizing the specified cost function U(R(t),u(t)), which is designed to expresses the objective of the control application. In the DHP method, the criticNN estimates the gradient of J(t) with respect to R(t); the letter λ is used as a short-hand no...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.