Knowledge tracing models, which are used to estimate students' ability or knowledge based on data collected from students' work on learning-related tasks, are a widely studied area within the educational data mining domain. In this work, we review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets, and with a novel data set of students learning to program. The evaluated DLKT models have been re-implemented for assessing reproducibility of previously reported results as well as for assessing the level of detail with which the models and their evaluations have been reported in previously published articles. The DLKT models are tested with different input and output layer variations found in the compared models that are independent of the main architectures of the models and also with different maximum attempt count options. Several metrics to compare and contrast the results are used to reflect on the quality and appropriateness of the evaluated knowledge tracing models. The evaluated knowledge tracing models include Vanilla-DKT, two variants of Long Short-Term Memory Deep Knowledge Tracing (LSTM-DKT), two variants of Dynamic Key-Value Memory Network (DKVMN), and Self-Attentive Knowledge Tracing (SAKT). We evaluate logistic regression, Bayesian Knowledge Tracing (BKT) and simple non-learning models as baselines. Our empirical evaluation suggests that while the DLKT models with tuned hyperparameters in general outperform non deep learning based models, the relative differences between the DLKT models are subtle and often vary between datasets. Specifically, we observe that no model consistently outperforms all other models in all datasets. Our results also show that simple non-learning models such as mean prediction can yield better performance than more sophisticated knowledge tracing models, especially in terms of accuracy, in some datasets. Further, our metric and hyperparameter analysis shows that the metric used to select the best model hyperparameters has a noticeable effect on the performance of the models, and also that some metrics appear more favorable than others for certain models. We also study the effect of input and output layer variations on model performance, and analyze the impact of filtering out long attempt sequences that has been implicitly and explicitly used in some studies. We further discuss the effect of non-model properties such as randomness and hardware on model performance, and finally, we discuss model performance replicability and related issues including pitfalls and suggest practices for future work. Our model implementations, evaluation code, and data are published as a part of this work.