Previous research has predicted subjective pain intensity from electroencephalographic (EEG) data using machine learning (ML) models. However, there is a paucity of externally validated ML models for pain assessment, particularly for continuous pain prediction (e.g., decoding pain ratings on a 101-point scale). We aimed to conduct the first external validation paradigm for ML regression models for pain intensity prediction from EEG data. Ninety-one subjects were recruited across three samples. Sample one (n = 40) was used for model development, sample two (n = 51) was used as a cross-subject external validation set, whilst sample three (n = 25) was used as a within-subjects temporal external validation set. Pneumatic pressure stimuli were delivered to the left-hand index fingernail bed at 10 graded intensity levels. Single-trial time-frequency features of peri-stimulus EEG were used to train a Random Forest (RF) model and long short-term memory (LSTM) network to predict pain intensity responses. Results demonstrated that both the RF model and LSTM network predicted pain intensity significantly more accurately than a random prediction model, with the mean absolute error (MAE) of the RF (best performing model) at 19.59, 21.29, and 18.90 for internal validation, cross-subject external validation, and within-subject external validation, respectively. However, neither model was able to predict pain intensity better than a baseline dummy model, which predicted the mean behavioural rating of the training set and did not have access to neural data. Moreover, in a replication of our recent work, we developed a RF model for the classification of low and high-pain trials, which demonstrated internal and external validation accuracies up to 64% and 58%, respectively. Taken together, our results suggest that using ML and EEG to predict continuous pain ratings is not currently feasible. However, classification models demonstrate some potential, consistently outperforming chance across validation samples. Further improvements such as composite measures are required to elevate ML performance to a clinically meaningful level.