Performance during perceptual decision-making exhibits an inverted-U relationship with arousal, but the underlying network mechanisms remain unclear. Here, we recorded from auditory cortex (A1) of behaving mice during passive tone presentation, while tracking arousal via pupillometry. We found that tone discriminability in A1 ensembles was optimal at intermediate arousal, revealing a population-level neural correlate of the inverted-U relationship. We explained this arousal-dependent coding using a spiking network model with a clustered architecture. Specifically, we show that optimal stimulus discriminability is achieved near a transition between a multi-attractor phase with metastable cluster dynamics (low arousal) and a single-attractor phase (high arousal). Additional signatures of this transition include arousal-induced reductions of overall neural variability and the extent of stimulus-induced variability quenching, which we observed in the empirical data. Our results elucidate computational principles underlying interactions between pupil-linked arousal, sensory processing, and neural variability, and suggest a role for phase transitions in explaining nonlinear modulations of cortical computations.