Catchment models are conventionally evaluated in terms of their response surface or likelihood surface constructed from model runs using different sets of model parameters. Model evaluation methods are mainly based upon the concept of the equifinality of model structures or parameter sets. The operational definition of equifinality is that multiple model structures/parameters are equally capable of producing acceptable simulations of catchment processes such as runoff. Examining various aspects of this convention, in this thesis I demonstrate their shortcomings and introduce improvements including new approaches and insights for evaluating catchment models as multiple working hypotheses (MWH). First (Chapter 2), arguing that there is more to equifinality than just model structures/parameters, I propose a theoretical framework to conceptualise various facets of equifinality, based on a meta-synthesis of a broad range of literature across geosciences, system theory, and philosophy of science. I distinguish between process-equifinality (equifinality within the real-world systems/processes) and model-equifinality (equifinality within models of real-world systems), explain various aspects of each of these two facets, and discuss their implications for hypothesis testing and modelling of hydrological systems under uncertainty. Second (Chapter 3), building up on this theoretical framework, I propose that characterising model-equifinality based on model internal fluxes — instead of model parameters which is the current approach to account for model-equifinality — provides valuable insights for evaluating catchment models. I developed a new method for model evaluation — called flux mapping — based on the equifinality of runoff generating fluxes of large ensembles of catchment model simulations (1 million model runs for each catchment). Evaluating the model behaviour within the flux space is a powerful approach, beyond the convention, to formulate testable hypotheses for runoff generation processes at the catchment scale. Third (Chapter 4), I further explore the dependency of the flux map of a catchment model upon the choice of model structure and parameterisation, error metric, and data information content. I compare two catchment models (SIMHYD and SACRAMENTO) across 221 Australian catchments (known as Hydrologic Reference Stations, HRS) using multiple error metrics. I particularly demonstrate the fundamental shortcomings of two widely used error metrics — i.e. Nash–Sutcliffe efficiency and Willmott’s refined index of agreement — in model evaluation. I develop the skill score version of Kling–Gupta efficiency (KGEss), and argue it is a more reliable error metric that the other metrics. I also compare two strategies of random sampling (Latin Hypercube Sampling) and guided search (Shuffled Complex Evolution) for model parameterisation, and discuss their implications in evaluating catchment models as MWH. Finally (Chapter 5), I explore how catchment characteristics (physiographic, climatic, and streamflow response characteristics) control the flux map of catchment models (i.e. runoff generation hypotheses). To this end, I formulate runoff generating hypotheses from a large ensemble of SIMHYD simulations (1 million model runs in each catchment). These hypotheses are based on the internal runoff fluxes of SIMHYD — namely infiltration excess overland flow, interflow and saturation excess overland flow, and baseflow — which represent runoff generation at catchment scale. I examine the dependency of these hypotheses on 22 different catchment attributes across 186 of the HRS catchments with acceptable model performance and sufficient parameter sampling. The model performance of each simulation is evaluated using KGEss metric benchmarked against the catchment-specific calendar day average observed flow model, which is more informative than the conventional benchmark of average overall observed flow. I identify catchment attributes that control the degree of equifinality of model runoff fluxes. Higher degree of flux equifinality implies larger uncertainties associated with the representation of runoff processes at catchment scale, and hence pose a greater challenge for reliable and realistic simulation and prediction of streamflow. The findings of this chapter provides insights into the functional connectivity of catchment attributes and the internal dynamics of model runoff fluxes.