Introduction: Adequate head and neck (HN) organ-at-risk (OAR) delineation is crucial for HN radiotherapy and for investigating the relationships between radiation dose to OARs and radiation-induced side effects. The automatic contouring algorithms that are currently in clinical use, such as atlas-based contouring (ABAS), leave room for improvement. The aim of this study was to use a comprehensive evaluation methodology to investigate the performance of HN OAR auto-contouring when using deep learning contouring (DLC), compared to ABAS. Methods: The DLC neural network was trained on 589 HN cancer patients. DLC was compared to ABAS by providing each method with an independent validation cohort of 104 patients, which had also been manually contoured. For each of the 22 OAR contours-glandular, upper digestive tract and central nervous system (CNS)-related structures-the dice similarity coefficient (DICE), and absolute mean and max dose differences (|Dmean-dose| and |Dmax-dose|) performance measures were obtained. For a subset of 7 OARs, an evaluation of contouring time, inter-observer variation and subjective judgement was performed. Results: DLC resulted in equal or significantly improved quantitative performance measures in 19 out of 22 OARs, compared to the ABAS (DICE/|Dmean dose|/|Dmax dose|: 0.59/4.2/4.1 Gy (ABAS); 0.74/1.1/0.8 Gy (DLC)). The improvements were mainly for the glandular and upper digestive tract OARs. DLC significantly reduced the delineation time for the inexperienced observer. The subjective evaluation showed that DLC contours were more often preferable to the ABAS contours overall, were considered to be more precise, and more often confused with manual contours. Manual contours still outperformed both DLC and ABAS; however, DLC results were within or bordering the inter-observer variability for the manual edited contours in this cohort. Conclusion: The DLC, trained on a large HN cancer patient cohort, outperformed the ABAS for the majority of HN OARs.