Cynthia HudleyDirect observation in natural settings has been recognized for nearly a century (Thomas, 1929) as an important methodology to capture the nuances of children' s behavior for research, clinical assessment, and program evaluation. These ecologically valid descriptive data complement more widely used behavior rating scales to evaluate the impact of intervention programming on children's aggressive behavior. While observation data make an undeniable contribution to evaluation findings, particularly for programs of behavior change, the validity and reliability of observation data have been debated for more than half a century (Arrington, 1943;Johnson and Bolstad, 1973). This paper discusses challenges encountered in a multisite, multimethod evaluation of an intervention program to reduce aggressive behavior in elementary schools. The evaluation design incorporated multiple sources of data, including playground observations. The observation data were difficult to collect and challenging to integrate into the full complement of evaluation findings. After briefly reviewing the literature on the costs and benefits of observation data that guided the decision to include an observation component in the evaluation design, the chapter will describe the evaluation project, a series of apparently unique, previously unreported difficulties in implementing the observation component, and the lessons learned from those difficulties.