Abstract. Calibration is an essential step for improving the accuracy of simulations generated using hydrologic models, and a key modeler decision is the selection of the performance metric to be optimized. It has been common to used squared error performance metrics, or normalized variants such as Nash-Sutcliffe Efficiency (NSE), based on the idea that their squarederror nature will emphasize the estimation of high flows. However, we find that NSE-based model calibrations actually result in poor reproduction of high flow events, such as the annual peak flows that are used for flood frequency estimation. Using 5 three different types of performance metrics, we calibrate two hydrological models, the "Variable Infiltration Capacity" model (VIC) and the "mesoscale Hydrologic Model" (mHM) and evaluate their ability to simulate high flow events for 492 basins throughout the contiguous United States. The metrics investigated are (1) NSE, (2) Kling-Gupta Efficiency (KGE) and variants, and (3) Annual Peak Flow Bias (APFB), where the latter is an application-specific "hydrologic signature" metric that focuses on annual peak flows. As expected, the application specific APFB metric produces the best annual peak flow estimates; however, 10 performance on other high flow related metrics is poor. In contrast, the use of NSE results in annual peak flow estimates that are more than 20% worse, primarily due to the tendency of NSE to result in underestimation of observed flow variability.Meanwhile, the use of KGE results in annual peak flow estimates that are better than from NSE, with only a slight degradation in performance with respect to other related metrics, particularly when a non-standard weighting of the components of KGE is used. Overall this work highlights the need for a fuller understanding of performance metric behavior and design in relation to 15 the desired goals of model calibration.