Offline Reinforcement Learning with On-Policy Q-Function Regularization

Shi, Laixi; Dadashi, Robert; Chi, Yuejie; Castro, Pablo Samuel; Geist, Matthieu

doi:10.1007/978-3-031-43421-1_27

Cited by 2 publications

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lu et al (2022) focus on offline reinforcement learning and insist that data distribution is represented and handled more clearly in offline reinforcement learning. Shi et al (2022) insist on the use of learning a near-optimal policy using history data by the agent with offline or batch reinforcement learning. Wang and Wu (2022) discuss the perspective of research directions of blockchain in the domain of operations research.…”

Section: Related Workmentioning

confidence: 99%

Q‐Learning model for selfish miners with optional stopping theorem for honest miners

Jeyasheela Rakkini,

Geetha

2023

Int Trans Operational Res

View full text Add to dashboard Cite

Bitcoin, the most popular cryptocurrency used in the blockchain, has miners join mining pools and get rewarded for the proportion of hash rate they have contributed to the mining pool. This work proposes the prediction of the relativegain of the miners by machine learning and deep learning models, the miners' selection of higher relativegain by the Q‐learning model, and an optional stopping theorem for honest miners in the presence of selfish mining attacks. Relativegain is the ratio of the number of blocks mined by selfish miners in the main canonical chain to the blocks of other miners. A Q‐learning agent with ε‐greedy value iteration, which seeks to increase the relativegain for the selfish miners, that takes into account all the other quintessential parameters, including the hash rate of miners, time warp, the height of the blockchain, the number of times the blockchain was reorganized, and the adjustment of the timestamp of the block, is implemented. Next, the ruin of the honest miners and the optional stopping theorem are analyzed so that the honest miners can quit the mining process before their complete ruin. We obtain a low mean square error of 0.0032 and a mean absolute error of 0.0464 in our deep learning model. Our Q‐learning model exhibits a linearly increasing curve, which denotes the increase in the relativegain caused by the selection of the action of performing the reorganization attack.

show abstract

Section: Related Workmentioning

confidence: 99%