Abstract-In this paper, we consider decentralized sequential decision making in distributed online recommender systems, where items are recommended to users based on their search query as well as their specific background including history of bought items, gender and age, all of which comprise the context information of the user. In contrast to centralized recommender systems, in which there is a single centralized seller who has access to the complete inventory of items as well as the complete record of sales and user information, in decentralized recommender systems each seller/learner only has access to the inventory of items and user information for its own products and not the products and user information of other sellers, but can get commission if it sells an item of another seller. Therefore the sellers must distributedly find out for an incoming user which items to recommend (from the set of own items or items of another seller), in order to maximize the revenue from own sales and commissions. We formulate this problem as a cooperative contextual bandit problem, analytically bound the performance of the sellers compared to the best recommendation strategy given the complete realization of user arrivals and the inventory of items, as well as the context-dependent purchase probabilities of each item, and verify our results via numerical examples on a distributed data set adapted based on Amazon data. We evaluate the dependence of the performance of a seller on the inventory of items the seller has, the number of connections it has with the other sellers, and the commissions which the seller gets by selling items of other sellers to its users.Index Terms-Multi-agent online learning, collaborative learning, distributed recommender systems, contextual bandits, regret.
In many types of multi-agent systems, distributed agents cooperate with each other to take actions with the goal of maximizing an overall system reward. However, in many of these systems, agents only receive a (perhaps noisy) global feedback about the realized overall reward rather than individualized feedback about the relative merit of their own actions with respect to the overall reward. If the contribution of an agent's actions to the overall reward is unknown a priori, it is crucial for the agents to utilize a distributed algorithm that still allows them to learn their best actions. In this paper, we rigorously formalize this problem and develop online learning algorithms which enable the agents to cooperatively learn how to maximize the overall reward in these global feedback scenarios without exchanging any information among themselves. We prove that, if the agents observe the global feedback without errors, the distributed nature of the considered multi-agent system results in no performance loss compared with the case where agents can exchange information. When the agents' individual observations are erroneous, existing centralized algorithms, including popular ones like UCB1, break down. To address this challenge, we propose a novel class of distributed algorithms that are robust to individual observation errors and whose performance can be analytically bounded. We prove that our algorithms' learning regrets -the losses incurred by the algorithms due to uncertainty -are logarithmically increasing in time and thus the time average reward converges to the optimal average reward. Moreover, we also illustrate how the regret depends on the size of the action space, and we show that this relationship is influenced by the informativeness of the reward structure with regard to each agent's individual action.We prove that when the overall reward is fully informative, regret is linear in the total number of actions , IEEE Transactions on Signal Processing 2 of all the agents. When the reward function is not informative, regret is linear in the number of joint actions. Our analytic and numerical results show that the proposed learning algorithms significantly outperform existing online learning solutions in terms of regret and learning speed. We illustrate how our theoretical framework can be used in practice by applying it to online Big Data mining using distributed classifiers. However, our framework can be applied to many other applications including online distributed decision making in cooperative multi-agent systems (e.g. packet routing or network coding in multi-hop networks), cross-layer optimization (e.g. parameter selection in different layers), multi-core processors etc. Index TermsMulti-agent learning, online learning, multi-armed bandits, Big Data mining, distributed cooperative learning, reward informativeness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.