Best arm identification (or, pure exploration) in multi-armed bandits is a fundamental problem in machine learning. In this paper we study the distributed version of this problem where we have multiple agents, and they want to learn the best arm collaboratively. We want to quantify the power of collaboration under limited interaction (or, communication steps), as interaction is expensive in many settings. We measure the running time of a distributed algorithm as the speedup over the best centralized algorithm where there is only one agent. We give almost tight round-speedup tradeoffs for this problem, along which we develop several new techniques for proving lower bounds on the number of communication steps under time or confidence constraints. * Chao Tao is supported in part by NSF IIS-1633215. Qin Zhang is supported in part by NSF IIS-1633215 and CCF-1844234. rounds. In each round each agent pull a (multi)set of arms without communication. For each agent at any time step, based on the indices and outcomes of all previous pulls, all the messages received, and the randomness of the algorithm (if any), the agent, if not in the wait mode, takes one of the following actions:(1) makes the next pull; (2) requests for a communication step and enters the wait mode; (3) terminates and outputs the answer. A communication step starts if all non-terminated agents are in the wait mode. After a communication step all non-terminated agents exit the wait mode and start a new round. During each communication step each agent can broadcast a message to every other agent. While we do not restrict the size of the message, in practice it will not be too large -the information of all pull outcomes of an agent can be described by an array of size at most n, with each coordinate storing a pair (c i , sum i ), where c i is the number of pulls on the i-th arm, and sum i is sum of the rewards of the c i pulls. Once terminated, the agent will not make any further actions. The algorithm terminates if all agents terminate. When the algorithm terminates, each agent should agree on the same best arm; otherwise we say the algorithm fails. The number of rounds of computation, denoted by R, is the number of communication steps plus one.Our goal in the collaborative learning model is to minimize the number of rounds R, and the running time T = r∈[R] t r , where t r is the maximum number of pulls made among the K agents in round r. The motivation for minimizing R is that initiating a communication step always comes with a big time overhead (due to network bandwidth, latency, protocol handshaking), and energy consumption (e.g., think about robots exploring in the deep sea and on Mars). Round-efficiency is one of the major concerns in all parallel/distributed computational models such as the BSP model [42] and MapReduce [16]. The total cost of the algorithm is a weighted sum of R and T , where the coefficients depend on the concrete applications. We are thus interested in the best round-time tradeoffs for collaborative best arm identification.Speedu...