Machine-learning (ML) is revolutionizing the study of ecology and evolution, but the performance of models (and their evaluation) is dependent on the quality of the training and validation data. Currently, we have standard metrics for evaluating model performance (e.g., precision, recall, F1), but these to some extent overlook the ultimate aim of addressing the specific research question to which the model will be applied. As improving performance metrics has diminishing returns, particularly when data is inherently noisy, biologists are often faced with the conundrum of investing more time in maximising performance metrics at the expense of doing the actual research. This leads to the question: how much noise can we accept in our ML models?Here, we start by describing an under-reported source of noise that can cause performance metrics to underestimate true model performance. Specifically, ambiguity between categories or mistakes in labelling of the validation data produces hard ceilings that limit performance metric scores. This common source of error in biological systems means that many models could be performing better than the metrics suggest.Next, we argue and show that imperfect models (e.g. low F1 scores) can still useable. We first propose a simulation framework to evaluate the robustness of a model for hypothesis testing. Second, we show how to determine the utility of the models by supplementing existing performance metrics with ‘biological validations’ that involve applying ML models to unlabelled data in different ecological contexts for which we can anticipate the outcome.Together, our simulations and case study show that effects sizes and expected biological patterns can be detected even when performance metrics are relatively low (e.g., F1 between 60-70%). In doing so, we provide a roadmap for validation approaches of ML models that are tailored to research in ecology and evolutionary biology.