Machine-learning prediction studies have shown potential to inform treatment stratification, but recent efforts to predict psychotherapy outcomes with clinical routine data have only resulted in moderate prediction accuracies. Neuroimaging data showed promise to predict treatment outcome, but previous prediction attempts have been exploratory and reported small clinical sample sizes. Herein, we aimed to examine the incremental predictive value of neuroimaging data in contrast to clinical and demographic data alone (for which results were previously published), using a two-level multimodal ensemble machine-learning strategy. We used pretreatment structural and task-based fMRI data to predict virtual reality exposure therapy outcome in a bicentric sample of
N
=
190
patients with spider phobia. First, eight 1st-level random forest classifications were conducted using separate data modalities (clinical questionnaire scores and sociodemographic data, cortical thickness and gray matter volumes, functional activation, connectivity, connectivity-derived graph metrics, and BOLD signal variance). Then, the resulting predictions were used to train a 2nd-level classifier that produced a final prediction. No 1st-level or 2nd-level classifier performed above chance level except BOLD signal variance, which showed potential as a contributor to higher-level prediction from multiple regions across the brain (1st-level balanced
accuracy
=
0.63
). Overall, neuroimaging data did not provide any incremental accuracy for treatment outcome prediction in patients with spider phobia with respect to clinical and sociodemographic data alone. Thus, we advise caution in the interpretation of prediction performances from small-scale, single-site patient samples. Larger multimodal datasets are needed to further investigate individual-level neuroimaging predictors of therapy response in anxiety disorders.