One of the key challenges for multi-core processors in the nano-CMOS era is dealing with the increased temperatures. It is imperative that peak temperatures are reduced and that heat is spread as evenly on the chip as possible to avoid mutual heating and high thermal gradients between processor cores. Approaches have emerged which share a global power budget among multiple cores in order to meet these objectives. However, while these approaches act proactively in distributing power across the chip before thermal problems arise, changes in the respective strategies remain reactive to a temperature threshold. Our approach uses reinforcement learning in order to dynamically change what we call power trading strategies before thermal thresholds are hit based on past recorded observations. Through learning, our hierarchical approach is also able to distribute so-called multiple power budgets at once thereby making power trading more effective, reaching a decrease in peak temperatures of around 4% compared to a fully distributed approach -which can be critical at near-threshold temperatures in terms of transient errors -while also decreasing the number of deadline misses by a factor of 7. Our technique has been verified by deploying a thermal camera.