Abstract. Deep learning (DL) rainfall–runoff models outperform conceptual, process-based models in a range of applications. However, it remains unclear whether DL models can produce physically plausible projections of streamflow under climate change. We investigate this question through a sensitivity analysis of modeled responses to increases in temperature and potential evapotranspiration (PET), with other meteorological variables left unchanged. Previous research has shown that temperature-based PET methods overestimate evaporative water loss under warming compared with energy budget-based PET methods. We therefore assume that reliable streamflow responses to warming should exhibit less evaporative water loss when forced with smaller, energy-budget-based PET compared with temperature-based PET. We conduct this assessment using three conceptual, process-based rainfall–runoff models and three DL models, trained and tested across 212 watersheds in the Great Lakes basin. The DL models include a Long Short-Term Memory network (LSTM), a mass-conserving LSTM (MC-LSTM), and a novel variant of the MC-LSTM that also respects the relationship between PET and evaporative water loss (MC-LSTM-PET). After validating models against historical streamflow and actual evapotranspiration, we force all models with scenarios of warming, historical precipitation, and both temperature-based (Hamon) and energy-budget-based (Priestley–Taylor) PET, and compare their responses in long-term mean daily flow, low flows, high flows, and seasonal streamflow timing. We also explore similar responses using a national LSTM fit to 531 watersheds across the United States to assess how the inclusion of a larger and more diverse set of basins influences signals of hydrological response under warming. The main results of this study are as follows: The three Great Lakes DL models substantially outperform all process-based models in streamflow estimation. The MC-LSTM-PET also matches the best process-based models and outperforms the MC-LSTM in estimating actual evapotranspiration. All process-based models show a downward shift in long-term mean daily flows under warming, but median shifts are considerably larger under temperature-based PET (−17 % to −25 %) than energy-budget-based PET (−6 % to −9 %). The MC-LSTM-PET model exhibits similar differences in water loss across the different PET forcings. Conversely, the LSTM exhibits unrealistically large water losses under warming using Priestley–Taylor PET (−20 %), while the MC-LSTM is relatively insensitive to the PET method. DL models exhibit smaller changes in high flows and seasonal timing of flows as compared with the process-based models, while DL estimates of low flows are within the range estimated by the process-based models. Like the Great Lakes LSTM, the national LSTM also shows unrealistically large water losses under warming (−25 %), but it is more stable when many inputs are changed under warming and better aligns with process-based model responses for seasonal timing of flows. Ultimately, the results of this sensitivity analysis suggest that physical considerations regarding model architecture and input variables may be necessary to promote the physical realism of deep-learning-based hydrological projections under climate change.