Recently, unmanned aerial vehicle (UAV)-based communications gained a lot of attention due to their numerous applications, especially in rescue services in post-disaster areas where the terrestrial network is wholly malfunctioned. Multiple access/gateway UAVs are distributed to fully cover the post-disaster area as flying base stations to provide communication coverage, collect valuable information, disseminate essential instructions, etc. The access UAVs after gathering/broadcasting the necessary information should select and fly towards one of the surrounding gateways for relaying their information. In this paper, the gateway UAV selection problem is addressed. The main aim is to maximize the long-term average data rates of the UAVs relays while minimizing the flights’ battery cost, where millimeter wave links, i.e., using 30~300 GHz band, employing antenna beamforming, are used for backhauling. A tool of machine learning (ML) is exploited to address the problem as a budget-constrained multi-player multi-armed bandit (MAB) problem. In this setup, access UAVs act as the players, and the arms are the gateway UAVs, while the rewards are the average data rates of the constructed relays constrained by the battery cost of the access UAV flights. In this decentralized setting, where information is neither prior available nor exchanged among UAVs, a selfish and concurrent multi-player MAB strategy is suggested. Towards this end, three battery-aware MAB (BA-MAB) algorithms, namely upper confidence bound (UCB), Thompson sampling (TS), and the exponential weight algorithm for exploration and exploitation (EXP3), are proposed to realize gateways selection efficiently. The proposed BA-MAB-based gateway UAV selection algorithms show superior performance over approaches based on near and random selections in terms of total system rate and energy efficiency.