This paper delves into the complexities of the Multi-Armed Bandit (MAB) problem, a fundamental concept in reinforcement learning and probability theory, with a focus on its application in recommendation systems and dynamic fields such as dynamic pricing and investment. It begins by shedding light on the essential paradox at the heart of the MAB problem – the balance between exploration and exploitation within limited parameters. The study primarily centers on Upper Confidence Bound (UCB) policies, especially UCB-tuned and Asymptotically Optimal UCB, noted for their adept balance between exploration and utilization. The novel contribution of this research is the enhancement of these UCB policies via an innovative weighted average method, leading to the development of WA-UCB-tuned and WA Asymptotically Optimal UCB algorithms. The research rigorously compares these optimized iterations with traditional UCB1, UCB-tuned, and Asymptotically Optimal UCB across varied MAB models featuring different numbers of arms. This study provides an exhaustive introduction to the MAB problem and pertinent UCB policies, the methodology behind the weighted average optimization, extensive experimental analysis, and comprehensive evaluations of the findings. The results showcase marked improvements in algorithmic performance, suggesting significant advancements in the domain of recommendation systems and other applications of the MAB problem.