Unmanned aerial vehicles (UAVs) are receiving increasing attention due to their wide range of applications. Flying ad-hoc networks (FANETs) enable inter-communication between UAVs, but their dynamics and fast changes in mobility and topology make it challenging to achieve stable routing. Clustering based FANETs provide a self-organizing approach for routing, but clustering is a decentralized process based on different utilities that can affect performance. In this work, we propose a centralized agent placed in a remote ground station to adjust the role of each utility in the cluster-head arbitration. Our approach uses six utilities, including node centrality, residual energy, link holding time, velocity similarity, buffer occupancy, and diversity. The developed reinforcement learning (RL)-clustering approach enables routing over multiple planes, with each plane representing a network cell with a different coverage radius of nodes, including Femto, Pico, and Micro planes. We propose a novel reward formulation for effective learning of agents that includes indicators of stability based on role change, energy consumption, and confirmation message through the selected node. For evaluation, the developed RL based agent was evaluated on three types of agents, namely, Q-learning, DQN, and DDPG and it was compared with random agent. Results have shown the superiority of our developed RL based clustering in terms of stability, energy consumption and most network metrics.