The relative skill of 21 regional and global biogeochemical models was assessed in terms of how well the models reproduced observed net primary productivity (NPP) and environmental variables such as nitrate concentration (NO3), mixed layer depth (MLD), euphotic layer depth (Zeu), and sea ice concentration, by comparing results against a newly updated, quality‐controlled in situ NPP database for the Arctic Ocean (1959–2011). The models broadly captured the spatial features of integrated NPP (iNPP) on a pan‐Arctic scale. Most models underestimated iNPP by varying degrees in spite of overestimating surface NO3, MLD, and Zeu throughout the regions. Among the models, iNPP exhibited little difference over sea ice condition (ice‐free versus ice‐influenced) and bottom depth (shelf versus deep ocean). The models performed relatively well for the most recent decade and toward the end of Arctic summer. In the Barents and Greenland Seas, regional model skill of surface NO3 was best associated with how well MLD was reproduced. Regionally, iNPP was relatively well simulated in the Beaufort Sea and the central Arctic Basin, where in situ NPP is low and nutrients are mostly depleted. Models performed less well at simulating iNPP in the Greenland and Chukchi Seas, despite the higher model skill in MLD and sea ice concentration, respectively. iNPP model skill was constrained by different factors in different Arctic Ocean regions. Our study suggests that better parameterization of biological and ecological microbial rates (phytoplankton growth and zooplankton grazing) are needed for improved Arctic Ocean biogeochemical modeling.