The rapid advancement of protein generative models necessitates robust and principled methods for their evaluation and comparison. As new models of increasing complexity continue to emerge, it is crucial to ensure that the metrics used for assessment are well-understood and reliable. In this work, we conduct a systematic investigation of commonly used metrics for evaluating protein generative models, focusing on quality, diversity, and distributional similarity. We examine the behavior of these metrics under various conditions, including synthetic perturbations and real-world generative models. Our analysis explores different design choices, parameters, and underlying representation models, revealing how these factors influence metric performance. We identify several challenges in applying these metrics, such as sample size dependencies, sensitivity to data distribution shifts, and computational efficiency trade-offs. By testing metrics on both synthetic datasets with controlled properties and outputs from state-of-the-art protein generators, we provide insights into each metric's strengths, limitations, and practical applicability. Based on our findings, we offer a set of practical recommendations for researchers to consider when evaluating protein generative models, aiming to contribute to the development of more robust and meaningful evaluation practices in the field of protein design.