Recent observations with varied schedules and types (moving average, snapshot, or regularly spaced) can help to improve streamflow forecasts, but it is challenging to integrate them effectively. Based on a long short-term memory (LSTM) streamflow model, we tested multiple versions of a flexible procedure we call data integration (DI) to leverage recent discharge measurements to improve forecasts. DI accepts lagged inputs either directly or through a convolutional neural network unit. DI ubiquitously elevated streamflow forecast performance to unseen levels, reaching a record continental-scale median Nash-Sutcliffe Efficiency coefficient value of 0.86. Integrating moving-average discharge, discharge from the last few days, or even average discharge from the previous calendar month could all improve daily forecasts. Directly using lagged observations as inputs was comparable in performance to using the convolutional neural network unit. Importantly, we obtained valuable insights regarding hydrologic processes impacting LSTM and DI performance. Before applying DI, the base LSTM model worked well in mountainous or snow-dominated regions, but less well in regions with low discharge volumes (due to either low precipitation or high precipitation-energy synchronicity) and large interannual storage variability. DI was most beneficial in regions with high flow autocorrelation: it greatly reduced baseflow bias in groundwater-dominated western basins and also improved peak prediction for basins with dynamical surface water storage, such as the Prairie Potholes or Great Lakes regions. However, even DI cannot elevate performance in high-aridity basins with 1-day flash peaks. Despite this limitation, there is much promise for a deep-learning-based forecast paradigm due to its performance, automation, efficiency, and flexibility.
The behaviors and skills of models in many geosciences (e.g., hydrology and ecosystem sciences) strongly depend on spatially-varying parameters that need calibration. A well-calibrated model can reasonably propagate information from observations to unobserved variables via model physics, but traditional calibration is highly inefficient and results in non-unique solutions. Here we propose a novel differentiable parameter learning (dPL) framework that efficiently learns a global mapping between inputs (and optionally responses) and parameters. Crucially, dPL exhibits beneficial scaling curves not previously demonstrated to geoscientists: as training data increases, dPL achieves better performance, more physical coherence, and better generalizability (across space and uncalibrated variables), all with orders-of-magnitude lower computational cost. We demonstrate examples that learned from soil moisture and streamflow, where dPL drastically outperformed existing evolutionary and regionalization methods, or required only ~12.5% of the training data to achieve similar performance. The generic scheme promotes the integration of deep learning and process-based models, without mandating reimplementation.
Dissolved
oxygen (DO) reflects river metabolic pulses and is an
essential water quality measure. Our capabilities of forecasting DO
however remain elusive. Water quality data, specifically DO data here,
often have large gaps and sparse areal and temporal coverage. Earth
surface and hydrometeorology data, on the other hand, have become
largely available. Here we ask: can a Long Short-Term Memory (LSTM)
model learn about river DO dynamics from sparse DO and intensive (daily)
hydrometeorology data? We used CAMELS-chem, a new data set with DO
concentrations from 236 minimally disturbed watersheds across the
U.S. The model generally learns the theory of DO solubility and captures
its decreasing trend with increasing water temperature. It exhibits
the potential of predicting DO in “chemically ungauged basins”,
defined as basins without any measurements of DO and broadly water
quality in general. The model however misses some DO peaks and troughs
when in-stream biogeochemical processes become important. Surprisingly,
the model does not perform better where more data are available. Instead,
it performs better in basins with low variations of streamflow and
DO, high runoff-ratio (>0.45), and winter precipitation peaks.
Results
here suggest that more data collections at DO peaks and troughs and
in sparsely monitored areas are essential to overcome the issue of
data scarcity, an outstanding challenge in the water quality community.
The accuracy of these models has important implications for relevant government agencies and public stakeholders that place trust in them. The demand for accurate modeling capabilities will likely be on the rise due to increased risks of floods and droughts because of climate change (IPCC, 2021). Traditionally, regional hydrologic models describe not only streamflow but also other water stores in the hydrologic cycle (snow, surface ponding, soil moisture, and groundwater), as well as fluxes (evapotranspiration, surface runoff, subsurface runoff, and baseflow), whereas newer, data-driven machine learning approaches tend to focus on prediction of the variable on which it has been trained. The physical states (stores) and fluxes in traditional models help to provide a full narrative of the event, for example, high antecedent soil moisture or thawing snow primed the watershed for floods, which are important for communication with stakeholders.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.