Multimodal fusion is an effective approach to take advantage of cross-information among multiple imaging data to better understand brain diseases. However, most current fusion approaches are blind, without adopting any prior information. To date, there is increasing interest to uncover the neurocognitive mapping of specific behavioral measurement on enriched brain imaging data; hence, a supervised, goal-directed model that enables a priori information as a reference to guide multimodal data fusion is in need and a natural option. Here we proposed a fusion with reference model, called “multi-site canonical correlation analysis with reference plus joint independent component analysis” (MCCAR+jICA), which can precisely identify co-varying multimodal imaging patterns closely related to reference information, such as cognitive scores. In a 3-way fusion simulation, the proposed method was compared with its alternatives on estimation accuracy of both target component decomposition and modality linkage detection. MCCAR+jICA outperforms others with higher precision. In human imaging data, working memory performance was utilized as a reference to investigate the covarying functional and structural brain patterns among 3 modalities and how they are impaired in schizophrenia. Two independent cohorts (294 and 83 subjects respectively) were used. Interestingly, similar brain maps were identified between the two cohorts, with substantial overlap in the executive control networks in fMRI, salience network in sMRI, and major white matter tracts in dMRI. These regions have been linked with working memory deficits in schizophrenia in multiple reports, while MCCAR+jICA further verified them in a repeatable, joint manner, demonstrating the potential of such results to identify potential neuromarkers for mental disorders.