Abstract. A key challenge for biological oceanography is relating the physiological
mechanisms controlling phytoplankton growth to the spatial distribution of
those phytoplankton. Physiological mechanisms are often isolated by varying
one driver of growth, such as nutrient or light, in a controlled laboratory
setting producing what we call “intrinsic relationships”. We contrast
these with the “apparent relationships” which emerge in the environment in
climatological data. Although previous studies have found machine learning
(ML) can find apparent relationships, there has yet to be a systematic study
examining when and why these apparent relationships diverge from the
underlying intrinsic relationships found in the lab and how and why this may depend on the method applied. Here we conduct a proof-of-concept study
with three scenarios in which biomass is by construction a function of
time-averaged phytoplankton growth rate. In the first scenario, the inputs
and outputs of the intrinsic and apparent relationships vary over the
same monthly timescales. In the second, the intrinsic relationships relate
averages of drivers that vary on hourly timescales to biomass, but the
apparent relationships are sought between monthly averages of these inputs
and monthly-averaged output. In the third scenario we apply ML to the output
of an actual Earth system model (ESM). Our results demonstrated that when
intrinsic and apparent relationships operate on the same spatial and
temporal timescale, neural network ensembles (NNEs) were able to extract the
intrinsic relationships when only provided information about the apparent
relationships, while colimitation and its inability to extrapolate resulted in random forests (RFs) diverging from the true response. When
intrinsic and apparent relationships operated on different timescales (as
little separation as hourly versus daily), NNEs fed with apparent
relationships in time-averaged data produced responses with the right shape
but underestimated the biomass. This was because when the intrinsic
relationship was nonlinear, the response to a time-averaged input differed
systematically from the time-averaged response. Although the limitations
found by NNEs were overestimated, they were able to produce more realistic
shapes of the actual relationships compared to multiple linear regression.
Additionally, NNEs were able to model the interactions between predictors
and their effects on biomass, allowing for a qualitative assessment of the
colimitation patterns and the nutrient causing the most limitation. Future
research may be able to use this type of analysis for observational datasets
and other ESMs to identify apparent relationships between biogeochemical
variables (rather than spatiotemporal distributions only) and identify
interactions and colimitations without having to perform (or at least
performing fewer) growth experiments in a lab. From our study, it appears
that ML can extract useful information from ESM output and could likely do
so for observational datasets as well.