A key feature of animal and human decision-making is to balance exploring unknown options for information gain (directed exploration) versus exploiting known options for immediate reward, which is often examined using restless bandit problems. Recurrent neural network models (RNNs) have recently gained traction in both human and systems neuroscience work on reinforcement learning. Here we comprehensively compared the performance of a range of RNN architectures as well as human learners on restless four-armed bandit problems. The best-performing architecture (LSTM network with computation noise) exhibited human-level performance. Cognitive modeling showed that human and RNN behavior is best described by a learning model with terms accounting for perseveration and directed exploration. However, whereas human learners exhibited a positive effect of uncertainty on choice probability (directed exploration), RNNs showed the reverse effect (uncertainty aversion), in conjunction with increased perseveration. RNN hidden unit dynamics revealed that exploratory choices were associated with a disruption of choice predictive signals during states of low state value, resembling a win-stay-loose-shift strategy, and resonating with previous single unit recording findings in monkey prefrontal cortex. During exploration trials, RNN selected exploration targets predominantly based on their recent value, but tended to avoid more uncertain options. Our results highlight both similarities and differences between exploration behavior as it emerges in RNNs, and computational mechanisms identified in cognitive and systems neuroscience work.