This study explored the behavioral and neural activity characteristics of audiovisual temporal integration in motion perception from both implicit and explicit perspectives. The streaming-bouncing bistable paradigm (SB task) was employed to investigate implicit temporal integration, while the corresponding simultaneity judgment task (SJ task) was used to examine explicit temporal integration. The behavioral results revealed a negative correlation between implicit and explicit temporal processing. In the ERP results of both tasks, three neural phases (PD100, ND180, and PD290) in the fronto-central region were identified as reflecting integration effects and the auditory-evoked multisensory N1 component may serve as a primary component responsible for cross-modal temporal processing. However, there were significant differences between the VA ERPs in the SB and SJ tasks and the influence of speed on implicit and explicit integration effects also varied. The aforementioned results, building upon the validation of previous temporal renormalization theory, suggest that implicit and explicit temporal integration operate under distinct processing modes within a shared neural network. This underscores the brain’s flexibility and adaptability in cross-modal temporal processing.