Predictive toxicity models rely on large amounts of accurate in vivo data. Here, we analyze the quality of in vivo data from the U.S. EPA Toxicity Reference Database (ToxRefDB), using chemical-induced anemia as an example. Considerations include variation in experimental conditions, changes in terminology over time, distinguishing negative from missing results, observer and diagnostic bias, and data transcription errors. Within ToxRefDB, we use hematological data on 658 chemicals tested in one or more of 1738 studies (subchronic rat or chronic rat, mouse, or dog). Anemia was reported most frequently in the rat subchronic studies, followed by chronic studies in dog, rat, and then mouse. Concordance between studies for a positive finding of anemia (same chemical, different laboratories) ranged from 90% (rat subchronic predicting rat chronic) to 40% (mouse chronic predicting rat chronic). Concordance increased with manual curation by 20% on average. We identified 49 chemicals that showed an anemia phenotype in at least two species. These included 14 aniline moiety-containing compounds that were further analyzed for their potential to be metabolically transformed into substituted anilines, which are known anemia-causing chemicals. This analysis should help inform future use of in vivo databases for model development.