Abstract. Watershed scale models simulating hydrology and water quality have advanced rapidly in sophistication, process 5 representation, flexibility in model structure, and input data. Given the importance of these models to support decision-making for a wide range of environmental issues, the hydrology community is compelled to improve the metrics used to evaluate model performance. More targeted and comprehensive metrics will facilitate better and more efficient calibration and will help demonstrate that the model is useful for the intended purpose. Here we introduce a suite of new tools for model evaluation,packaged as an open-source Hydrologic Model Evaluation (HydroME) Toolbox. Specifically, we demonstrate the use of box 10 plots to illustrate the full distribution of common model performance metrics, such as R 2 , use of Euclidian distance, empirical Quantile-Quantile (Q-Q) plots and flow duration curves as simple metrics to identify and localize errors in model simulations.Further, we demonstrate the use of magnitude squared coherence to compare the frequency content between observed and modeled streamflow and wavelet coherence to localize frequency mismatches in time. We provide a rationale for a hierarchical selection of parameters to adjust during calibration and recommend that modelers progress from parameters with the most 15 uncertainty to the least uncertainty, namely starting with pure calibration parameters, followed by derived parameters, and finally measured parameters. We apply these techniques in the calibration and evaluation of models of two watersheds, the Le Sueur River Basin (2880 km 2 ) and Root River Basin (4300 km 2 ) in southern Minnesota, USA.