Distribution forecasts P over future quantities or events are routinely made in hydrology but usually traded for a (likelihood‐weighted) mean or median prediction to accommodate error measures or scoring functions such as the mean absolute error or mean squared error. Case in point is the so‐called KG efficiency (KGE) of Gupta et al. (2009, https://doi.org/10.1016/j.jhydrol.2009.08.003) and improvements thereof (Lamontagne et al., 2020, https://doi.org/10.1029/2020wr027101), which have rapidly gained popularity among hydrologists as alternative scoring functions to the commonly used Nash and Sutcliffe (1970, https://doi.org/10.1016/0022‐1694(70)90255‐6) efficiency, but are equally exclusive in how they quantify model performance using only single‐valued output of the quantities of interest. This point‐valued mapping necessarily implies a loss of information about model performance. This paper advocates the use of probabilistic watershed model training, evaluation and diagnostics. Distribution evaluation opens a mature literature on scoring rules whose strong statistical underpinning provides, as we will demonstrate, the theory, context and guidelines necessary for the development of robust information‐theoretically principled metrics for watershed signatures. These so‐called hydrograph functionals are scalar‐valued mappings of major behavioral watershed functions embodied in a strictly proper scoring rule. We discuss past developments that led to the current state‐of‐the‐art of distribution evaluation in hydrology and review scoring rules for dichotomous and categorical events, quantiles (intervals) and density forecasts. We are particularly concerned with elicitable functionals and scoring rule propriety, discuss the decomposition of scoring rules into a sharpness, reliability and entropy term and present diagnostically appealing strictly proper divergence scores of hydrograph functionals for flood frequency analysis, flow duration and recession curves. The usefulness and power of distribution‐based model evaluation and diagnostics by means of scoring rules is demonstrated on simple illustrative problems and discharge distributions simulated with watershed models using random sampling and Bayesian model averaging. The presented theory (a) enables a more complete evaluation of distribution forecasts, (b) offers a statistically principled means for watershed model training, evaluation, diagnostics and selection using hydrograph functionals and/or extreme events and (c) provides a universal framework for metric development of watershed signatures, promoting metric standardization and reproducibility.