Observations are widely used in research and evaluation to characterize teaching and learning activities. Because conducting observations is typically resource intensive, it is important that inferences from observation data are made confidently. While attention focuses on interrater reliability, the reliability of a single-class measure over the course of a semester receives less attention. We examined the use and limitations of observation for evaluating teaching practices, and how many observations are needed during a typical course to make confident inferences about teaching practices. We conducted two studies based on generalizability theory to calculate reliabilities given class-to-class variation in teaching over a semester. Eleven observations of class periods over the length of a semester were needed to achieve a reliable measure, many more than the one to four class periods typically observed in the literature. Findings suggest practitioners may need to devote more resources than anticipated to achieve reliable measures and comparisons.