Various methods are used in the literature for calibration of conceptual rainfall-runoff models. However, very rarely the question on the relation between the number of model runs (or function calls) and the quality of solutions found is asked. In this study two lumped conceptual rainfall-runoff models (HBV and GR4J with added snow module) are calibrated for five catchments, located in temperate climate zones of USA and Poland, by means of three modern variants of Evolutionary Computation and Swarm Intelligence optimization algorithms with four different maximum numbers of function calls set to 1000, 3000, 10,000 and 30,000. At the calibration stage, when more than 10,000 function calls is used, only marginal improvement in model performance has been found, irrespective of the catchment or calibration algorithm. For validation data, the relation between the number of function calls and model performance is even weaker, in some cases the longer calibration, the poorer modelling performance. It is also shown that the opinion on the model performance based on different popular hydrological criteria, like the Nash-Sutcliffe coefficient or the Persistence Index, may be misleading. This is because very similar, largely positive values of Nash-Sutcliffe coefficient obtained on different catchments may be accompanied by contradictory values of the Persistence Index.