To further extend the applicability of wearable sensors, methods for accurately extracting subtle psychological information from the sensor data are required. However, accessing subjective information in everyday life, such as cognitive load, remains challenging. To bring consensus on methods for cognitive load monitoring, a machine learning challenge is organized. The participants developed machine learning methods for cognitive load classification using wrist-worn physiological sensors' data, namely heart rate, R-R intervals, skin conductance, and skin temperature. The data from subjects solving cognitive tasks of varying difficulty was used for the challenge. This article presents a systematic comparison and multistrategic performance evaluation of the thirteen methods submitted to this challenge. A systematic comparison of preprocessing, classifiers, and implementation techniques is presented. Performance variations for different task difficulty levels, different subjects, and different experiment periods are evaluated. The results indicate that the most robust methods used multimodal sensor data, classical classification approaches such as decision trees and support vector machines or their ensembles, and Bayesian hyperparameter optimization for hyperparameter tuning. The most accurate models used handcrafted features that are further selected using sequential backward floating search and evaluated using stratified person-aware crossvalidation strategy. Moreover, the results indicated better classification performance for specific test subjects, the tasks with the highest difficulty, and in some cases, the time elapsed since the start of the experiment. This dependency is likely due to model overfitting or due to the subjective nature of the psychophysiological process. The intersubject variability in responses is challenging to be captured through objective binary labels for cognitive load, thereby warranting more sophisticated annotation approaches.