We evaluated the score stability of the Framework for Teaching (FFT), a prominent observation instrument used for teacher evaluation. Three raters each scored 200 reading and mathematics lessons taught by 20 kindergarten teachers. Using Generalizability theory analyses, we decomposed the FFTâs Classroom Environment, Instruction, and Total scores into potential sources of variation (teachers, lessons, raters, and their interactions). The scoresâ variances attributable to differences among teachers were 71% and 76% for Classroom Environment, 49% and 37% for Instruction, and 69% and 66% for the Total score, for reading and mathematics, respectively. Reliability estimates (G) ranged from 0.92 to 0.96 for Classroom Environment and Total scores; they were 0.87 and 0.79 for reading and mathematics Instruction. Decision studies indicated that two raters, each scoring three reading lessons or four mathematics lessons, are necessary to achieve sufficiently reliable Total scores. For Instruction scores, three raters each scoring seven readings lessons are needed; more than four raters each scoring eight lessons are needed for mathematics.