The indispensability of visual working memory (VWM) in human daily life suggests its importance in higher cognitive functions and neurological diseases. However, despite the extensive research efforts, most findings on the neural basis of VWM are limited to a unimodal context (either structure or function) and have low generalization. To address the above issues, this study proposed the usage of multimodal neuroimaging in combination with machine learning to reveal the neural mechanism of VWM across a large cohort (N = 547). Specifically, multimodal magnetic resonance imaging features extracted from voxel‐wise amplitude of low‐frequency fluctuations, gray matter volume, and fractional anisotropy were used to build an individual VWM capacity prediction model through a machine learning pipeline, including the steps of feature selection, relevance vector regression, cross‐validation, and model fusion. The resulting model exhibited promising predictive performance on VWM (r = .402, p < .001), and identified features within the subcortical‐cerebellum network, default mode network, motor network, corpus callosum, anterior corona radiata, and external capsule as significant predictors. The main results were then compared with those obtained on emotional regulation and fluid intelligence using the same pipeline, confirming the specificity of our findings. Moreover, the main results maintained well under different cross‐validation regimes and preprocess strategies. These findings, while providing richer evidence for the importance of multimodality in understanding cognitive functions, offer a solid and general foundation for comprehensively understanding the VWM process from the top down.