Single-word naming is one of the most widely used experimental paradigms for studying how we read words. Following the seminal study by Spieler and Balota (Psychological Science 8:411-416, 1997), accounting for variance in item-level naming databases has become a major challenge for computational models of word reading. Using a new large-scale database of naming responses, we first provided a precise estimate of the amount of reproducible variance that models should try to account for with such databases. Second, by using an item-level measure of delayed naming, we showed that it captures not only the variance usually explained by onset phonetic properties, but also an additional part of the variance related to output processes. Finally, by comparing the item means from this new database with the ones reported in a previous study, we found that the two sets of item response times were highly reliable (r 0 .94) when the variance related to onset phonetic properties and voice-key sensitivity was factored out. Overall, the present results provide new guidelines for testing computational models of word naming with item-level databases.Keywords Single-word naming . Item-level performance . Large-scale databases . Accounting for variance . Testing computational modelsThe precision of theories and the grain size of empirical evidence do not always follow the same developmental trajectories in science, and this statement is certainly also valid for psycholinguistic research. Indeed, three decades ago, with the development of the first computerimplemented models (e.g., McClelland & Rumelhart, 1981), the precision of theoretical predictions has grown dramatically with the possibility for these artificial, dynamic systems to generate item-level predictions. At the same time, the grain size of empirical evidence was still bounded by a set of benchmark effects (e.g., the word frequency effect and consistency or regularity effects) modulated, in the best cases, by interaction effects (e.g., the Regularity × Frequency interaction; Taraban & McClelland, 1987).To compensate for the disequilibrium between the grain size of the empirical data and the grain size of theoretical predictions, Spieler and Balota (1997) collected a largescale database (31 participants, reading aloud a set of 2,870 monosyllabic English words) and tested the descriptive adequacy of various computational models of word reading against these item-level measures of performance, initiating what we may call a "nanopsycholinguistic" approach. The outcome was quite disappointing for modelers, since the best model only accounted for around 10 % of the item variance. More surprisingly, a linear combination of three simple factors (Log Frequency, Neighborhood Density, and Word Length) accounted for 27.1 % of the variance, and when the onset phoneme characteristics were taken into account, this percentage even reached 43.1 %, which was much higher than the predictions of any of the models tested in that study. Instead of re-equilibrating the precision rati...