Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is output-optimal, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is instance-optimal, i.e., the algorithm is optimal on every single instance, but this may not always be achievable.In the traditional RAM model of computation, the classical Yannakakis algorithm is instanceoptimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins, instance-optimality can still be achieved in the MPC model. Then, we give a new MPC algorithm for an arbitrary acyclic join with load O ( IN p + √ IN·OUT p ), where IN, OUT are the input and output sizes of the join, and p is the number of servers in the MPC model. This improves the MPC version of the Yannakakis algorithm by an O( OUT IN ) factor. Furthermore, we show that this is output-optimal when OUT = O(p · IN), for every acyclic but non-r-hierarchical join. Finally, we give the first output-sensitive lower bound for the triangle join in the MPC model, showing that it is inherently more difficult than acyclic joins. other servers, receives messages from other servers, and then does some local computation. The complexity of the algorithm is measured by the number of rounds and the load, denoted as L, which is the maximum message size received by any server in any round., all problems can be solved trivially in one round by simply sending all data to one server. Initial efforts were mostly spent on what can be done in a single round of computation [3,26,7,8,24,26], but recently, more interest has been given to multi-round (but still a constant) algorithms [2,22,24], since new main memory based systems, such as Spark and Flink, have much lower overhead per round than previous generations like Hadoop.The MPC model can be considered as a simplified version of the BSP model [32], but it has enjoyed more popularity in recent years. This is mostly because the BSP model takes too many measures into consideration, such as communication costs, local computation time, memory consumption, etc. The MPC model unifies all these costs with one parameter L, which makes the model much simpler. Meanwhile, although L is defined as the maximum incoming message size of a server, it is also closely related with the local computation time and memory consumption, which are both increasing functions of L. Thus, L serves as a good surrogate of these other cost measures. This is also why the MPC model does not limit the outgoing message size of a server, which is less relevant to other costs.All our algorithms work under the mild assumption IN ≥ p 1+ where > 0 is any sma...