Automated decision making systems are increasingly being used in real-world applications. In these systems for the most part, the decision rules are derived by minimizing the training error on the available historical data. Therefore, if there is a bias related to a sensitive attribute such as gender, race, religion, etc. in the data, say, due to cultural/historical discriminatory practices against a certain demographic, the system could continue discrimination in decisions by including the said bias in its decision rule. We present an information theoretic framework for designing fair predictors from data, which aim to prevent discrimination against a specified sensitive attribute in a supervised learning setting. We use equalized odds as the criterion for discrimination, which demands that the prediction should be independent of the protected attribute conditioned on the actual label. To ensure fairness and generalization simultaneously, we compress the data to an auxiliary variable, which is used for the prediction task. This auxiliary variable is chosen such that it is decontaminated from the discriminatory attribute in the sense of equalized odds. The final predictor is obtained by applying a Bayesian decision rule to the auxiliary variable.
Actor-critic style two-time-scale algorithms are very popular in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the global convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory. Our analysis applies to very general settings, as we only assume that the underlying Markov chain is ergodic under all policies (the so-called Recurrence assumption). We employ -greedy sampling in order to ensure enough exploration.For a fixed exploration parameter , we show that the natural actor critic algorithm is O( 1 T 1/4 + ) close to the global optimum after T iterations of the algorithm.By carefully diminishing the exploration parameter as the iterations proceed, we also show convergence to the global optimum at a rate of O(1/T 1/6 ).
Markov Decision Processes are classically solved using Value Iteration and Policy Iteration algorithms. Recent interest in Reinforcement Learning has motivated the study of methods inspired by optimization, such as gradient ascent. Among these, a popular algorithm is the Natural Policy Gradient, which is a mirror descent variant for MDPs. This algorithm forms the basis of several popular Reinforcement Learning algorithms such as Natural actor-critic, TRPO, PPO, etc, and so is being studied with growing interest. It has been shown that Natural Policy Gradient with constant step size converges with a sublinear rate of O(1/k) to the global optimal. In this paper, we present improved finite time convergence bounds, and show that this algorithm has geometric (also known as linear) asymptotic convergence rate. We further improve this convergence result by introducing a variant of Natural Policy Gradient with adaptive step sizes. Finally, we compare different variants of policy gradient methods experimentally.
In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of O(ǫ −3 ), outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs n-step TD-learning algorithm with a properly chosen n. We present finite-sample convergence bounds on this critic under both constant and diminishing step sizes, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of O(1/T ) after T iterations. Combining the finite sample error bounds of actor and the critic, we obtain the O(ǫ −3 ) sample complexity. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature. 2 sample complexity, which is the best known convergence bound in the literature for AC algorithms with function approximation.Novelty in the Critic. Off-policy TD with function approximation is famously [65] known to diverge due to deadly triad. To overcome this difficulty, we employ n-step TD-learning, and show that a proper choice of n naturally achieves convergence, and we present finite-sample bounds under both constant and diminishing stepsizes. To the best of our knowledge, we are the first to design a single time-scale off-policy TD with function approximation with provable finite-sample bounds.Novelty in the Actor. NAC under function approximation was developed in [1] by projecting the Qvalues (gradients) to the lower dimensional space, and this involves the use of the discounted state visitation distribution, which is hard to estimate. We develop a new NAC algorithm for the function approximation setting that is instead based on the solution of a projected Bellman equation [73], which our critic is designed to solve.Exploration through Off-Policy Sampling. We establish the convergence bounds under the minimum set of assumptions, viz.,ergodicity under the behavior policy, which ensures sufficient exploration, and thus resolving challenges faced in on-policy sampling. As a result, learning can be done using a single trajectory of samples generated by the behavior policy, and we do not require constant reset of the system that was introduced in on-policy AC algorithms [1, 75] to ensure exploration. A similar observation about employing off-policy sampling to ensure exploration has been made in the tabular setting in [34].1.2. Related Literature. The two main approaches for learning an optimal policy in an RL problem are value space methods, such as Q-learning, and policy space methods, such as AC. The Q-learning algorithm proposed in [77] is perhaps the most well-known value space method. The asymptotic convergence of Qlearning was established in [1...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.