Humans share with a variety of animal species the spontaneous ability to detect the numerical correspondence between limited quantities of visual objects and discrete auditory events. Here, we explored how such mental representation is generated in the visual modality by monitoring a parieto-occipital ERP component, N2pc, whose amplitude covaries with the number of visual targets in explicit enumeration. Participants listened to an auditory sequence of one to three tones followed by a visual search display containing one to three targets. In Experiment 1, participants were asked to respond based on the numerical correspondence between tones and visual targets. In Experiment 2, participants were asked to ignore the tones and detect a target presence in the search display. The results of Experiment 1 showed an N2pc amplitude increase determined by the number of visual targets followed by a centroparietal ERP component modulated by the numerical correspondence between tones and visual targets. The results of Experiment 2 did not show an N2pc amplitude increase as a function of the number of visual targets. However, the numerical correspondence between tones and visual targets influenced N2pc amplitude. By comparing a subset of amplitude/latency parameters between Experiment 1 and 2, the present results suggest N2pc reflects two modes for representing the number of visual targets. One mode, susceptible to subjective control, relies on visual target segregation for exact target individuation, whereas a different mode, likely enabling spontaneous cross-modal matching, relies on the extraction of rough information about number of targets from visual input.