Mutation testing is considered as one of the most powerful testing methods. It operates by asking testers to design tests that reveal a set of mutants, which are purposemade injected defects. Evidently, the strength of the method strongly depends on the used mutants. However, this dependence raises concerns regarding the mutation testing practice that is implemented by existing tools. Thus, it is probable that implementation inadequacies can lead to incompetent results. In this paper, we cross-evaluate three popular mutation testing tools for Java, namely MUJAVA, MAJOR and PIT, with respect to their effectiveness. We perform an empirical study of 3,324 manually analysed mutants from real-world projects and we find that there are large differences between the tools' effectiveness, ranging from 76% to 88%, with MUJAVA achieving the best results. We also demonstrate that no tool is able to subsume the others and provide practical recommendations on how to strengthen each one of the studied tools. Finally, our analysis shows that 11%, 12% and 7% of the mutants generated by MUJAVA, MAJOR and PIT are equivalent, respectively.