The brain is subjected to multi‐modal sensory information in an environment governed by statistical dependencies. Mismatch responses (MMRs), classically recorded with EEG, have provided valuable insights into the brain's processing of regularities and the generation of corresponding sensory predictions. Only few studies allow for comparisons of MMRs across multiple modalities in a simultaneous sensory stream and their corresponding cross‐modal context sensitivity remains unknown. Here, we used a tri‐modal version of the roving stimulus paradigm in fMRI to elicit MMRs in the auditory, somatosensory and visual modality. Participants (N = 29) were simultaneously presented with sequences of low and high intensity stimuli in each of the three senses while actively observing the tri‐modal input stream and occasionally reporting the intensity of the previous stimulus in a prompted modality. The sequences were based on a probabilistic model, defining transition probabilities such that, for each modality, stimuli were more likely to repeat (p = .825) than change (p = .175) and stimulus intensities were equiprobable (p = .5). Moreover, each transition was conditional on the configuration of the other two modalities comprising global (cross‐modal) predictive properties of the sequences. We identified a shared mismatch network of modality general inferior frontal and temporo‐parietal areas as well as sensory areas, where the connectivity (psychophysiological interaction) between these regions was modulated during mismatch processing. Further, we found deviant responses within the network to be modulated by local stimulus repetition, which suggests highly comparable processing of expectation violation across modalities. Moreover, hierarchically higher regions of the mismatch network in the temporo‐parietal area around the intraparietal sulcus were identified to signal cross‐modal expectation violation. With the consistency of MMRs across audition, somatosensation and vision, our study provides insights into a shared cortical network of uni‐ and multi‐modal expectation violation in response to sequence regularities.