Automatically detecting mental state such as stress from video images of the face could support evaluating stress responses in applicants for high risk jobs or contribute to timely stress detection in challenging operational settings (e.g., aircrew, command center operators). Challenges in automatically estimating mental state include the generalization of models across contexts and across participants. We here aim to create robust models by training them using data from different contexts and including physiological features. Fifty-one participants were exposed to different types of stressors (cognitive, social evaluative and startle) and baseline variants of the stressors. Video, electrocardiogram (ECG), electrodermal activity (EDA) and self-reports (arousal and valence) were recorded. Logistic regression models aimed to classify between high and low arousal and valence across participants, where “high” and “low” were defined relative to the center of the rating scale. Accuracy scores of different models were evaluated: models trained and tested within a specific context (either a baseline or stressor variant of a task), intermediate context (baseline and stressor variant of a task), or general context (all conditions together). Furthermore, for these different model variants, only the video data was included, only the physiological data, or both video and physiological data. We found that all (video, physiological and video-physio) models could successfully distinguish between high- and low-rated arousal and valence, though performance tended to be better for (1) arousal than valence, (2) specific context than intermediate and general contexts, (3) video-physio data than video or physiological data alone. Automatic feature selection resulted in inclusion of 3–20 features, where the models based on video-physio data usually included features from video, ECG and EDA. Still, performance of video-only models approached the performance of video-physio models. Arousal and valence ratings by three experienced human observers scores based on part of the video data did not match with self-reports. In sum, we showed that it is possible to automatically monitor arousal and valence even in relatively general contexts and better than humans can (in the given circumstances), and that non-contact video images of faces capture an important part of the information, which has practical advantages.