This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird's advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.
Intelligent machines using machine learning algorithms are ubiquitous, ranging from simple data analysis and pattern recognition tools to complex systems that achieve superhuman performance on various tasks. Ensuring that they do not exhibit undesirable behavior—that they do not, for example, cause harm to humans—is therefore a pressing problem. We propose a general and flexible framework for designing machine learning algorithms. This framework simplifies the problem of specifying and regulating undesirable behavior. To show the viability of this framework, we used it to create machine learning algorithms that precluded the dangerous behavior caused by standard machine learning algorithms in our experiments. Our framework for designing machine learning algorithms simplifies the safe and responsible application of machine learning.
Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.
The main objective in the ad recommendation problem is to find a strategy that, for each visitor of the website, selects the ad that has the highest probability of being clicked. This strategy could be computed using supervised learning or contextual bandit algorithms, which treat two visits of the same user as two separate independent visitors, and thus, optimize greedily for a single step into the future. Another approach would be to use reinforcement learning (RL) methods, which differentiate between two visits of the same user and two different visitors, and thus, optimizes for multiple steps into the future or the life-time value (LTV) of a customer. While greedy methods have been well-studied, the LTV approach is still in its infancy, mainly due to two fundamental challenges: how to compute a good LTV strategy and how to evaluate a solution using historical data to ensure its "safety" before deployment. In this paper, we tackle both of these challenges by proposing to use a family of off-policy evaluation techniques with statistical guarantees about the performance of a new strategy. We apply these methods to a real ad recommendation problem, both for evaluating the final performance and for optimizing the parameters of the RL algorithm. Our results show that our LTV optimization algorithm equipped with these off-policy evaluation techniques outperforms the greedy approaches. They also give fundamental insights on the difference between the click through rate (CTR) and LTV metrics for performance evaluation in the ad recommendation problem.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.