A restless bandit with no observable states for recommendation systems and communication link scheduling

Meshram, Rahul; Manjunath, D.; Gopalan, Aditya

doi:10.1109/cdc.2015.7403456

Cited by 13 publications

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A DM perceives one of finite number of messages. Assume that O = {1, 2, 3, • • • , K} represents the set of messages 1 . If the message k ∈ O is observed with known probability from state j under action a for systen i and this is denoted by q a i,jk = Pr (k | s t,i = j, a t,i = a) .…”

Section: Model Descriptionmentioning

confidence: 99%

“…Restless multi-armed bandits with partially observable states have been recently found applications in online recommendation systems [1], opportunistic communication systems [2]- [4], machine maintenance [5], age of information, [6]. Restless multi-armed bandits (RMABs) are class of sequential decision problem with multiple independent Markov processes which are coupled via number of independent process that are activated simultaneously, [7].…”

Section: Introductionmentioning

confidence: 99%

“…Most of RMAB problems with partially observable states are studied for two state model with various assumptions on transition probabilities, reward structure and observation probabilities, [1]- [4], [10]- [12]. Much less attention is given to more than two state model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Indexability and Rollout Policy for Multi-State Partially Observable Restless Bandits

Meshram¹,

Kaza²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Restless multi-armed bandits with partially observable states has applications in communication systems, age of information and recommendation systems. In this paper, we study multi-state partially observable restless bandit models. We consider three different models based on information observable to decision maker-1) no information is observable from actions of a bandit 2) perfect information from bandit is observable only for one action on bandit, there is a fixed restart state, i.e., transition occurs from all other states to that state 3) perfect state information is available to decision maker for both actions on a bandit and there are two restart state for two actions. We develop the structural properties. We also show a threshold type policy and indexability for model 2 and 3. We present Monte Carlo (MC) rollout policy. We use it for whittle index computation in case of model 2. We obtain the concentration bound on value function in terms of horizon length and number of trajectories for MC rollout policy. We derive explicit index formula for model 3. We finally describe Monte Carlo rollout policy for model 1 when it is difficult to show indexability. We demonstrate the numerical examples using myopic policy, Monte Carlo rollout policy and Whittle index policy. We observe that Monte Carlo rollout policy is good competitive policy to myopic.

show abstract

Section: Model Descriptionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Indexability and Rollout Policy for Multi-State Partially Observable Restless Bandits

Meshram¹,

Kaza²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…RMABs have been used for various applications in across domains. Some specific applications include recommendation systems [29], [30], sensor scheduling and target detection [31], multi-UAV routing for observing targets [32], stochastic network optimization [24]. Most models assume instantaneous feedback and their main interest is to study the Whittleindex or myopic policy.…”

Section: Literature Overview and Contributionsmentioning

confidence: 99%

Sequential Decision Making With Limited Observation Capability: Application to Wireless Networks

Kaza

Meshram

Mehta

et al. 2019

IEEE Trans. Cogn. Commun. Netw.

Self Cite

View full text Add to dashboard Cite

This work studies a generalized class of restless multi-armed bandits with hidden states and allow cumulative feedback, as opposed to the conventional instantaneous feedback. We call them lazy restless bandits (LRB) as the events of decisionmaking are sparser than events of state transition. Hence, feedback after each decision event is the cumulative effect of the following state transition events. The states of arms are hidden from the decision-maker and rewards for actions are state dependent. The decision-maker needs to choose one arm in each decision interval, such that long term cumulative reward is maximized.As the states are hidden, the decision-maker maintains and updates its belief about them. It is shown that LRBs admit an optimal policy which has threshold structure in belief space. The Whittle-index policy for solving LRB problem is analyzed; indexability of LRBs is shown. Further, closed-form index expressions are provided for two sets of special cases; for more general cases, an algorithm for index computation is provided. An extensive simulation study is presented; Whittle-index, modified Whittleindex and myopic policies are compared. Lagrangian relaxation of the problem provides an upper bound on the optimal value function; it is used to assess the degree of sub-optimality various policies.

show abstract

“…Further, author proposed the heuristic index based policy, it is referred to as Whittle index policy. In [19,22], we have considered a general system of a restless multi-armed bandit with unobservable states and action dependent transitions. In [22] we show that such a system is approximately Whittle-indexable.…”

Section: Related Literaturementioning

confidence: 99%

A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems

Meshram

Gopalan

Manjunath

2017

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. We consider a restless multi-armed bandit (RMAB) in which there are two types of arms, say A and B. Each arm can be in one of two states, say 0 or 1. Playing a type A arm brings it to state 0 with probability one and not playing it induces state transitions with arm-dependent probabilities. Whereas playing a type B arm leads it to state 1 with probability 1 and not playing it gets state that dependent on transition probabilities of arm. Further, play of an arm generates a unit reward with a probability that depends on the state of the arm. The belief about the state of the arm can be calculated using a Bayesian update after every play. This RMAB has been designed for use in recommendation systems where the user's preferences depend on the history of recommendations. This RMAB can also be used in applications like creating of playlists or placement of advertisements. In this paper we formulate the long term reward maximization problem as infinite horizon discounted reward and average reward problem. We analyse the RMAB by first studying discounted reward scenario. We show that it is Whittle-indexable and then obtain a closed form expression for the Whittle index for each arm calculated from the belief about its state and the parameters that describe the arm. We next analyse the average reward problem using vanishing discounted approach and derive the closed form expression for Whittle index. For a RMAB to be useful in practice, we need to be able to learn the parameters of the arms. We present an algorithm derived from Thompson sampling scheme, that learns the parameters of the arms and also illustrate its performance numerically.

show abstract

A restless bandit with no observable states for recommendation systems and communication link scheduling

Cited by 13 publications

References 15 publications

Indexability and Rollout Policy for Multi-State Partially Observable Restless Bandits

Indexability and Rollout Policy for Multi-State Partially Observable Restless Bandits

Sequential Decision Making With Limited Observation Capability: Application to Wireless Networks

A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems

Contact Info

Product

Resources

About