This study examines the effect of differential item functioning (DIF) items on test equating through multilevel item response models (MIRMs) and traditional IRMs. The performances of three different equating models were investigated under 24 different simulation conditions, and the variables whose effects were examined included sample size, test length, DIF magnitude, and the test type. The MIRMs, in which the DIF factors were added as parameters, were compared with the Stocking-Lord (SL) method (one of the IRM-based calibration methods) and concurrent calibration method. According to the results, differences were found in the performances of the methods under the analyzed conditions. More specifically, the MIRMs were able to identify the DIF items, carry out the equating processes, and eliminate the biases caused by DIF in only one analysis. However, this does not indicate that using MIRMs is the best approach since the increase in sample size and test length generally had a positive effect on IRM-based equating, whereas MIRMs were less affected by these two conditions. Considering the IRM-based methods, it was found that separate calibration methods were more affected by the presence of DIF items compared to concurrent calibration. Moreover, this effect becomes most significant when DIF items are in common test and the magnitude of DIF is C.