In today’s rapidly evolving technological landscape, human–machine interaction has become an issue that should be systematically explored. This research aimed to examine the impact of different pre-cue modes (visual, auditory, and tactile), stimulus modes (visual, auditory, and tactile), compatible mapping modes (both compatible (BC), transverse compatible (TC), longitudinal compatible (LC), and both incompatible (BI)), and stimulus onset asynchrony (200 ms/600 ms) on the performance of participants in complex human–machine systems. Eye movement data and a dual-task paradigm involving stimulus–response and manual tracking were utilized for this study. The findings reveal that visual pre-cues can captivate participants’ attention towards peripheral regions, a phenomenon not observed when visual stimuli are presented in isolation. Furthermore, when confronted with visual stimuli, participants predominantly prioritize continuous manual tracking tasks, utilizing focal vision, while concurrently executing stimulus–response compatibility tasks with peripheral vision. Furthermore, the average pupil diameter tends to diminish with the use of visual pre-cues or visual stimuli but expands during auditory or tactile stimuli or pre-cue modes. These findings contribute to the existing literature on the theoretical design of complex human–machine interfaces and offer practical implications for the design of human–machine system interfaces. Moreover, this paper underscores the significance of considering the optimal combination of stimulus modes, pre-cue modes, and stimulus onset asynchrony, tailored to the characteristics of the human–machine interaction task.