In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of n trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications such as online auctions and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets and payoff functions which enable the design of efficient solutions.In this work we study a very general setting for the multiarmed bandit problem in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric. We refer to this problem as the Lipschitz MAB problem. We present a complete solution for the multi-armed problem in this setting. That is, for every metric space (L, X) we define an isometry invariant MaxMinCOV(X) which bounds from below the performance of Lipschitz MAB algorithms for X, and we present an algorithm which comes arbitrarily close to meeting this bound. Furthermore, our technique gives even better results for benign payoff functions.
Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called bandits with knapsacks, that combines bandit learning with aspects of stochastic integer programming. In particular, a bandit algorithm needs to solve a stochastic version of the well-known knapsack problem, which is concerned with packing items into a limited-size knapsack. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sublinear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems.We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel "balanced exploration" paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains, including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sublinear in the supply. * An extended abstract of this paper [Badanidiyuru et al., 2013] was published in IEEE FOCS 2013. This paper has undergone several rounds of revision since the original "full version" has been published on arxiv.org in May 2013. Presentation has been thoroughly revised throughout, from low-level edits to technical intuition and details to high-level restructuring of the paper. Some of the results have been improved: a stronger regret bound in one of the main results (Theorem 4.1), and a more general example of preadjusted discretization for dynamic pricing (Theorem 7.7). Introduction discusses a significant amount of follow-up work, and is up-to-date regarding open questions.
It is widely believed that computing payments needed to induce truthful bidding is somehow harder than simply computing the allocation. We show that the opposite is true: creating a randomized truthful mechanism is essentially as easy as a single call to a monotone allocation rule. Our main result is a general procedure to take a monotone allocation rule for a single-parameter domain and transform it (via a black-box reduction) into a randomized mechanism that is truthful in expectation and individually rational for every realization. The mechanism implements the same outcome as the original allocation rule with probability arbitrarily close to 1, and requires evaluating that allocation rule only once. We also provide an extension of this result to multiparameter domains and cycle-monotone allocation rules, under mild star-convexity and nonnegativity hypotheses on the type space and allocation rule, respectively.Because our reduction is simple, versatile, and general, it has many applications to mechanism design problems in which re-evaluating the allocation rule is either burdensome or informationally impossible. Applying our result to the multiarmed bandit problem, we obtain truthful randomized mechanisms whose regret matches the information-theoretic lower bound up to logarithmic factors, even though prior work showed this is impossible for truthful deterministic mechanisms. We also present applications to offline mechanism design, showing that randomization can circumvent a communication complexity lower bound for deterministic payments computation, and that it can also be used to create truthful shortest path auctions that approximate the welfare of the VCG allocation arbitrarily well, while having the same running time complexity as Dijkstra's algorithm. This is a merged and revised version of the conference papers [Babaioff et al. 2010[Babaioff et al. , 2013b that appeared in the ACM Conference on Electronic Commerce (ACM EC) in 2010 and 2013, respectively. This article contains all results from Babaioff et al. [2010] and the main result from Babaioff et al. [2013b] (in Section 8). This version is updated to reflect the current status of the follow-up work and open questions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.