Missing data, such as item responses in multilevel data, are ubiquitous in educational research settings. Researchers in the item response theory (IRT) context have shown that ignoring such missing data can create problems in the estimation of the IRT model parameters. Consequently, several imputation methods for dealing with missing item data have been proposed and shown to be effective when applied with traditional IRT models. Additionally, a nonimputation direct likelihood analysis has been shown to be an effective tool for handling missing observations in clustered data settings. This study investigates the performance of six simple imputation methods, which have been found to be useful in other IRT contexts, versus a direct likelihood analysis, in multilevel data from educational settings. Multilevel item response data were simulated on the basis of two empirical data sets, and some of the item scores were deleted, such that they were missing either completely at random or simply at random. An explanatory IRT model was used for modeling the complete, incomplete, and imputed data sets. We showed that direct likelihood analysis of the incomplete data sets produced unbiased parameter estimates that were comparable to those from a complete data analysis. Multipleimputation approaches of the two-way mean and corrected item mean substitution methods displayed varying degrees of effectiveness in imputing data that in turn could produce unbiased parameter estimates. The simple random imputation, adjusted random imputation, item means substitution, and regression imputation methods seemed to be less effective in imputing missing item scores in multilevel data settings.Keywords Item response theory . Multilevel data . Missing data . Imputation methods Multilevel data in education settings can contain complex patterns of nested sources of variability. For instance, suppose that exercise items nested in courses or chapters with varying difficulty levels are presented to students nested in schools or classes with varying ability levels. For each student, scores to the items, along with person and item properties, can be recorded. The collected data are clustered or multilevel in nature, consisting of the properties of schools, students, chapters, and items and the students' item scores (e.g., binary, pass/fail) on attempted items. These data can be modeled statistically-say, using item response theory (IRT; van der Linden & Hambleton, 1997)-to explain and understand student characteristics in relation to the item properties.De Boeck and Wilson (2004) described IRT models within the framework of generalized linear mixed models (GLMMs) or nonlinear mixed models (NLMMs), also accounting for more complex multilevel structures than the structure of measurement occasions within subjects,