Due to their high maneuverability, flexible deployment, and line of sight (LoS) transmission, unmanned aerial vehicles (UAVs) could be an alternative option for reliable device-to-device (D2D) communication when a direct link is not available between source and destination devices due to obstacles in the signal propagation path. Therefore, in this paper, we have proposed a UAVs-supported self-organized device-to-device (USSD2D) network where multiple UAVs are employed as aerial relays. We have developed a novel optimization framework that maximizes the total instantaneous transmission rate of the network by jointly optimizing the deployed location of UAVs, device association, and UAVs’ channel selection while ensuring that every device should achieve a given signal to interference noise ratio (SINR) constraint. As this joint optimization problem is nonconvex and combinatorial, we adopt reinforcement learning (RL) based solution methodology that effectively decouples it into three individual optimization problems. The formulated problem is transformed into a Markov decision process (MDP) where UAVs learn the system parameters according to the current state and corresponding action aiming to maximize the generated reward under the current policy. Finally, we conceive SARSA, a low complexity iterative algorithm for updating the current policy in the case of randomly deployed device pairs which achieves a good computational complexity-optimality tradeoff. Numerical results validate the analysis and provide various insights on the optimal deployment of UAVs. The proposed methodology improves the total instantaneous transmission rate of the network by 75.37%, 52.08%, and 14.77% respectively as compared with RS-FORD, ES-FIRD, and AOIV schemes.