SummaryWe investigated arousal scoring agreement within full‐night polysomnography in a multi‐centre setting. Ten expert scorers from seven centres annotated 50 polysomnograms using the American Academy of Sleep Medicine guidelines. The agreement between arousal indexes (ArIs) was investigated using intraclass correlation coefficients (ICCs). Moreover, kappa statistics were used to evaluate the second‐by‐second agreement in whole recordings and in different sleep stages. Finally, arousal clusters, that is, periods with overlapping arousals by multiple scorers, were extracted. The overall similarity of the ArIs was fair (ICC = 0.41), varying from poor to excellent between the scorer pairs (ICC = 0.04–0.88). The ArI similarity was better in respiratory (ICC = 0.65) compared with spontaneous (ICC = 0.23) arousals. The overall second‐by‐second agreement was fair (Fleiss’ kappa = 0.40), varying from poor to substantial depending on the scorer pair (Cohen's kappa = 0.07–0.68). Fleiss’ kappa increased from light to deep sleep (0.45, 0.45, and 0.53 for stages N1, N2, and N3, respectively), was moderate in the rapid eye movement stage (0.48), and the lowest in the wake stage (0.25). Over a half of the arousal clusters were scored by one or two scorers, and less than a third by at least five scorers. In conclusion, the scoring agreement varied depending on the arousal type, sleep stage, and scorer pair, but was overall relatively low. The most uncertain areas were related to spontaneous arousals and arousals scored in the wake stage. These results indicate that manual arousal scoring is generally not reliable, and that changes are needed in the assessment of sleep fragmentation for clinical and research purposes.