Abstract. Atmospheric chemistry models are a central tool to study the impact
of chemical constituents on the environment, vegetation and human health.
These models are numerically intense, and previous attempts to reduce the
numerical cost of chemistry solvers have not delivered transformative change. We show here the potential of a machine learning (in this case random forest
regression) replacement for the gas-phase chemistry in atmospheric chemistry
transport models. Our training data consist of 1 month (July 2013) of
output of chemical conditions together with the model physical state,
produced from the GEOS-Chem chemistry model v10. From this data set we train
random forest regression models to predict the concentration of each
transported species after the integrator, based on the physical and chemical
conditions before the integrator. The choice of prediction type has a strong
impact on the skill of the regression model. We find best results from
predicting the change in concentration for long-lived species and the
absolute concentration for short-lived species. We also find improvements
from a simple implementation of chemical families
(NOx = NO + NO2). We then implement the trained random forest predictors back into GEOS-Chem to
replace the numerical integrator. The machine-learning-driven GEOS-Chem model
compares well to the standard simulation. For ozone (O3), errors from using the
random forests (compared to the reference simulation) grow slowly and after
5 days the normalized mean bias (NMB), root mean square error (RMSE) and
R2 are 4.2 %, 35 % and 0.9, respectively; after 30 days the errors
increase to 13 %, 67 % and 0.75, respectively. The biases become largest
in remote areas such as the tropical Pacific where errors in the chemistry
can accumulate with little balancing influence from emissions or deposition.
Over polluted regions the model error is less than 10 % and has significant
fidelity in following the time series of the full model. Modelled
NOx shows similar features, with the most significant errors
occurring in remote locations far from recent emissions. For other species
such as inorganic bromine species and short-lived nitrogen species, errors
become large, with NMB, RMSE and R2 reaching >2100 % >400 % and
<0.1, respectively. This proof-of-concept implementation takes 1.8 times more time than the
direct integration of the differential equations, but optimization and
software engineering should allow substantial increases in speed. We discuss
potential improvements in the implementation, some of its advantages from
both a software and hardware perspective, its limitations, and its
applicability to operational air quality activities.