Selecting an appropriate model for a catchment is challenging, and choosing an inappropriate model can yield unreliable results. The Automatic Model Structure Identification (AMSI) method simultaneously calibrates model structural choices and model parameters, which reduces the workload of comparing different models. In this study we benchmark AMSI's capabilities in two ways, using 12 hydro‐climatically diverse Model Parameter Estimation Experiment catchments. First, we calibrate parameter values for 7,488 different model structures and test AMSI's ability to find the best‐performing models in this set. Second, we compare the performance of these 7,488 models and AMSI's selection to the performance of 45 commonly used, structurally more diverse, conceptual models. In both cases, we quantify model accuracy (through the Kling‐Gupta Efficiency) and model adequacy (through various hydrologic signatures). AMSI effectively identifies high‐accuracy models among the 7,488 options, with Kling‐Gupta‐Efficiency scores comparable to the best among the 45 models. However, model adequacy remains poor for the accurate models, regardless of the selection method. In nine of the tested catchments, none of the most accurate models replicate observed signatures with less than 50% errors; in the remaining three catchments, only a handful of models do so. This paper thus provides strong empirical evidence that relying on aggregated efficiency metrics is unlikely to result in hydrologically adequate models, no matter how the models themselves are selected. Nevertheless, AMSI has been shown to effectively search the model hypothesis space it was given. Combined with an improved calibration approach it can therefore offer new ways to address the challenges of model structure selection.