Improving multi-armed bandit algorithms in online pricing settings

Trovò, Francesco; Paladino, Stefano; Restelli, Marcello; Gatti, Nicola

doi:10.1016/j.ijar.2018.04.006

Cited by 17 publications

(16 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Contrarily, in this paper we consider the renewal price adjustment problem as a sequential decision process. This is not the first time that pricing problems are modeled as sequential decision making (Cesa-Bianchi et al, 2006;Blum & Hartline, 2005;Trovò et al, 2018). For instance, Cesa-Bianchi et al (2006) address a similar problem to the considered in this paper.…”

Section: Related Workmentioning

confidence: 90%

Reinforcement learning for pricing strategy optimization in the insurance industry

Krasheninnikova

Garcı́a

Maestre

et al. 2019

Engineering Applications of Artificial Intelligence

View full text Add to dashboard Cite

Pricing is a fundamental problem in the banking sector, and is closely related to a number of financial products such as credit scoring or insurance. In the insurance industry an important question arises, namely: how can insurance renewal prices be adjusted? Such an adjustment has two conflicting objectives. On the one hand, insurers are forced to retain existing customers, while on the other hand insurers are also forced to increase revenue. Intuitively, one might assume that revenue increases by offering high renewal prices, however this might also cause many customers to terminate their contracts. Contrarily, low renewal prices help retain most existing customers, but could negatively affect revenue. Therefore, adjusting renewal prices is a non-trivial problem for the insurance industry. In this paper, we propose a novel modelization of the renewal price adjustment problem as a sequential decision problem and, consequently, as a Markov decision process (MDP). In particular, this study analyzes two different strategies to carry out this adjustment. The first is about maximizing revenue analyzing the effect of this maximization on customer retention, while the second is about maximizing revenue subject to the client retention level not falling below a given threshold. The former case is related to MDPs with a single criterion to be optimized. The latter case is related to Constrained MDPs (CMDPs) with two criteria, where the first one is related to optimization, while the second is subject to a constraint. This paper also contributes with the resolution of these models by means of a modelfree Reinforcement Learning algorithm. Results have been reported using real data from the insurance division of BBVA, one of the largest Spanish companies in the banking sector.

show abstract

Section: Related Workmentioning

confidence: 90%

Reinforcement learning for pricing strategy optimization in the insurance industry

Krasheninnikova

Garcı́a

Maestre

et al. 2019

Engineering Applications of Artificial Intelligence

View full text Add to dashboard Cite

show abstract

“…Cohen et al (2020) developed an online contextual bandit approach for pricing online fashion products with each product defined by a set of features. Trovò et al (2018) applied multi-armed bandit algorithms to online pricing of non-perishable goods in both stationary and non-stationary environments. Several other papers have extended dynamic pricing to the full reinforcement learning problem where an agent must consider the long-term consequences of its actions.…”

Section: Related Workmentioning

confidence: 99%

Offline Deep Reinforcement Learning for Dynamic Pricing of Consumer Credit

Khraishi¹,

Okhrati²

2022

Preprint

View full text Add to dashboard Cite

We introduce a method for pricing consumer credit using recent advances in offline deep reinforcement learning. This approach relies on a static dataset and requires no assumptions on the functional form of demand. Using both real and synthetic data on consumer credit applications, we demonstrate that our approach using the conservative Q-Learning algorithm is capable of learning an effective personalized pricing policy without any online interaction or price experimentation.

show abstract

“…The authors propose algorithms that minimize the per-round pseudo-regret over an infinite time horizon. We also mention the work by Trovò, Paladino, Restelli, and Gatti (2018), who provide some bandit algorithms for dynamic pricing in non-stationary settings. Finally, the problem of non-stationarity with bounded per-round variation is tackled using contextual bandit techniques by Slivkins (2011), who designs the Contextual Zooming algorithm, and by Luo, Wei, Agarwal, and Langford (2018), for which they use a variant of the classic EXP4 algorithm.…”

Section: Related Workmentioning

confidence: 99%

Sliding-Window Thompson Sampling for Non-Stationary Settings

Trovò

Paladino

Restelli

et al. 2020

jair

Self Cite

View full text Add to dashboard Cite

Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings -- very common in real-world applications -- received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for nonstationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.

show abstract

Improving multi-armed bandit algorithms in online pricing settings

Cited by 17 publications

References 16 publications

Reinforcement learning for pricing strategy optimization in the insurance industry

Reinforcement learning for pricing strategy optimization in the insurance industry

Offline Deep Reinforcement Learning for Dynamic Pricing of Consumer Credit

Sliding-Window Thompson Sampling for Non-Stationary Settings

Contact Info

Product

Resources

About