Since the No Child Left Behind legislation, the assessment of teacher effectiveness (TE) for accountability purposes has been at the forefront of educational policy. Prominent among both already-existing and newly developed measures is the Classroom Assessment Scoring System (CLASS; Pianta, La Paro, & Hamre, 2008). The CLASS is used currently in over 40 states across the country (Teachstone, 2013; Office of Head Start, 2014) to make high-stakes decisions for teachers, including compensation, promotion, and termination. For this reason, it is important that measures like the CLASS are evaluated by research. Our research hypothesizes that if measures like the CLASS can be reliably used for high-stakes outcomes, then scores for individual teachers should remain stable over time, and particularly so within units of thematically related lessons. We used a single-subject design, reflective of the real-world uses of TE scores, to assess score stability for two kindergarten teachers purposively selected from a larger database. Stability ranges were created around mean scores and then visually examined. Significant variability was found between lessons for both teachers, particularly in the instructional support domain of the CLASS. We conclude that single observations are likely not sufficient to reliably evaluate teachers' instructional effectiveness. Further research should investigate: (1) if similar variability is found with a larger number of teachers when observed for longer periods of time; (2) if this instability is found when using other TE measures; (3) the factors that contribute to observed instability; and (4) the number of teacher observations needed to obtain accurate views of teachers' effectiveness patterns.