The ventral visual stream (VVS) is a hierarchically connected series of cortical areas known to underlie core object recognition behaviors, enabling humans and non-human primates to effortlessly recognize objects across a multitude of viewing conditions. While recent feedforward convolutional neural networks (CNNs) provide quantitatively accurate predictions of temporally-averaged neural responses throughout the ventral pathway, they lack two ubiquitous neuroanatomical features: local recurrence within cortical areas and long-range feedback from downstream areas to upstream areas. As a result, such models are unable to account for the temporally-varying dynamical patterns thought to arise from recurrent visual circuits, nor can they provide insight into the behavioral goals that these recurrent circuits might help support. In this work, we augment CNNs with local recurrence and long-range feedback, developing convolutional RNN (ConvRNN) network models that more correctly mimic the gross neuroanatomy of the ventral pathway. Moreover, when the form of the recurrent circuit is chosen properly, ConvRNNs with comparatively small numbers of layers can achieve high performance on a core recognition task, comparable to that of much deeper feedforward networks. We then compared these models to temporally fine-grained neural and behavioral recordings from primates to thousands of images. We found that ConvRNNs better matched these data than alternative models, including the deepest feedforward networks, on two metrics: 1) neural dynamics in V4 and inferotemporal (IT) cortex at late timepoints after stimulus onset, and 2) the varying times at which object identity can be decoded from IT, including more challenging images that take longer to decode. Moreover, these results differentiate within the class of ConvRNNs, suggesting that there are strong functional constraints on the recurrent connectivity needed to match these phenomena. Finally, we find that recurrent circuits that attain high task performance while having a smaller network size as measured by number of units, rather than another metric such as the number of parameters, are overall most consistent with these data. Taken together, our results evince the role of recurrence and feedback in the ventral pathway to reliably perform core object recognition while subject to a strong total network size constraint.