We consider the problem of computing a relational query q on a large input database of size n, using a large number p of servers. The computation is performed in rounds, and each server can receive only O(n/p 1−ε ) bits of data, where ε ∈ [0, 1] is a parameter that controls replication. We examine how many global communication steps are needed to compute q. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires ε ≥ 1 − 1/τ * , where τ * is the fractional vertex cover of the hypergraph of q. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuple-based. We show that for the class of tree-like queries there exists a tradeoff between the number of rounds and the space exponent ε. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication.than main memory access. In addition, any data reshuffling requires a global synchronization of all servers, which also comes at significant cost; for example, everyone needs to wait for the slowest server, and, worse, in the case of a straggler, or a local node failure, everyone must wait for the full recovery. Thus, the dominating complexity parameters in big data query processing are the number of communication steps, and the amount of data being exchanged.MapReduce-related models Several computation models have been proposed in order to understand the power of MapReduce and related massively parallel programming methods [9,16,17,1]. These all identify the number of communication steps/rounds as a main complexity parameter, but differ in their treatment of the communication.The first of these models was the MUD (Massive, Unordered, Distributed) model of Feldman et al. [9]. It takes as input a sequence of elements and applies a binary merge operation repeatedly, until obtaining a final result, similarly to a User Defined Aggregate in database systems. The paper compares MUD with streaming algorithms: a streaming algorithm can trivially simulate MUD, and the converse is also possible if the merge operators are computationally powerful (beyond PTIME).Karloff et al.[16] define MRC, a class of multi-round algorithms based on using the MapReduce primitive as the sole building block, and fixing specific parameters for balanced processing. The number of processors p is Θ(N 1− ), and each can exchange MapReduce outputs expressible in Θ(N 1− ) bits per step, resulting in Θ(N 2−2 ) total storage among the processors on a problem of size N. Their focus was algorithmic, showing simulations of other parallel models by MRC, as well as the power of two round algorithms for specific problems.Lower bounds for the single round MapReduce model are first discussed by Afrati et al....