Objective: The accurate prediction of seizure freedom after epilepsy surgery remains challenging. We investigated if 1) training more complex models, 2) recruiting larger sample sizes, or 3) using data-driven selection of clinical predictors would improve our ability to predict post-operative seizure outcome. We also conducted the first external validation of a machine learning model trained to predict post-operative seizure outcome. Methods: We performed a retrospective cohort study of 797 children who had undergone resective or disconnective epilepsy surgery at a single tertiary center. We extracted patient information from medical records and trained three models - a logistic regression, a multilayer perceptron, and an XGBoost model - to predict one-year post-operative seizure outcome on our dataset. We evaluated the performance of a recently published XGBoost model on the same patients. We further investigated the impact of sample size on model performance, using learning curve analysis to estimate performance at samples up to N=2,000. Finally, we examined the impact of predictor selection on model performance. Results: Our logistic regression achieved an accuracy of 72% (95% CI=68-75%, AUC=0.72), while our multilayer perceptron and XGBoost both achieved accuracies of 71% (95% CI-MLP=67-74%, AUC-MLP=0.70; 95% CI-XGBoost own=68-75%, AUC-XGBoost own=0.70). There was no significant difference in performance between our three models (all P>0.4) and they all performed better than the external XGBoost, which achieved an accuracy of 63% (95% CI=59-67%, AUC=0.62; P-LR=0.005, P-MLP=0.01, P-XGBoost own=0.01) on our data. All models showed improved performance with increasing sample size, with limited improvements above N=400. The best model performance was achieved with data-driven feature selection. Significance: We show that neither the deployment of complex machine learning models nor the assembly of thousands of patients alone is likely to generate significant improvements in our ability to predict post-operative seizure freedom. We instead propose that improved feature selection alongside collaboration, data standardization, and model sharing is required to advance the field.