Many studies emphasize the need of verbally representing pictorial metaphors, but few have empirically investigated whether and how the particular verbalization form match different types of pictorial metaphors. Using evoked response potentials (ERP), a 3 (pictorial structure: fusion, juxtaposition, literal image) × 2 [verbalization form: A是(is) B, A像(is like) B] within-group experiment was conducted among 36 participants. ERPs were time-locked to the onset of the verb [是/像(is/is like)] of the metaphor sentence that follows a pictorial metaphor to detect the verbo-pictorial incongruity in metaphor comprehension. The incongruity-based ERP analysis showed that pictorial metaphors, when verbalized in two forms, all induced frontal N1 effect, regardless of pictorial structures, only with a larger N1 amplitude for literal images in “A是(is) B.” A central stronger P2 was observed in “A像(is like) B” for three structures. Despite a general elicitation of posterior P3 in all conditions, a larger P3 was found for juxtapositions verbalized in “A像(is like) B” and for literal images verbalized in “A是(is) B.” There was no significant difference between two verbalization forms for fusion-structured pictorial metaphors. These findings suggest: (1) verbo-pictorial metaphors could induce incongruity-based attention; (2) higher verbo-pictorial semantic congruity and relatedness, indexed by stronger P2 and P3, confirmed “A像(is like) B” to be the more effective verbalization form in representing pictorial metaphors, specifically for juxtaposition-structured pictorial metaphors; (3) for non-metaphor advertising pictures, verbal metaphor showed an interference effect. The study not only reveals the neuro-cognitive mechanism of processing verbo-pictorial metaphors, but also offers neural reference for the design of effective multi-modal metaphor by finding an optimal match between PMs and verbalization forms.