In a cognitive wireless mesh network, licensed users (primary users, PUs) may rent surplus spectrum to unlicensed users (secondary users, SUs) for getting some revenue. For such spectrum sharing paradigm, maximizing the revenue is the key objective of the PUs while that of the SUs is to meet their requirements. These complex contradicting objectives are embedded in our reinforcement learning (RL) model that is developed and implemented as shown in this paper. The objective function is defined as the net revenue gained by PUs from renting some of their spectrum. RL is used to extract the optimal control policy that maximizes the PUs’ profit continuously over time. The extracted policy is used by PUs to manage renting the spectrum to SUs and it helps PUs to adapt to the changing network conditions. Performance evaluation of the proposed spectrum trading approach shows that it is able to find the optimal size and price of spectrum for each primary user under different conditions. Moreover, the approach constitutes a framework for studying, synthesizing and optimizing other schemes. Another contribution is proposing a new distributed algorithm to manage spectrum sharing among PUs. In our scheme, PUs exchange channels dynamically based on the availability of neighbor’s idle channels. In our cooperative scheme, the objective of spectrum sharing is to maximize the total revenue and utilize spectrum efficiently. Compared to the poverty-line heuristic that does not consider the availability of unused spectrum, our scheme has the advantage of utilizing spectrum efficiently.