The ocean floor, its species and habitats are under pressure from various human activities. Marine spatial planning and nature conservation aim to address these threats but require sufficiently detailed and accurate maps of the distribution of seabed substrates and habitats. Benthic habitat mapping has markedly evolved as a discipline over the last decade, but important challenges remain. To test the adequacy of current data products and classification approaches, we carried out a comparative study based on a common dataset of multibeam echosounder bathymetry and backscatter data, supplemented with groundtruth observations. The task was to predict the spatial distribution of five substrate classes (coarse sediments, mixed sediments, mud, sand, and rock) in a highly heterogeneous area of the south-western continental shelf of the United Kingdom. Five different supervised classification methods were employed, and their accuracy estimated with a set of samples that were withheld. We found that all methods achieved overall accuracies of around 50%. Errors of commission and omission were acceptable for rocky substrates, but high for all sediment types. We predominantly attribute the low map accuracy regardless of mapping approach to inadequacies of the selected classification system, which is required to fit gradually changing substrate types into a rigid scheme, low discriminatory power of the available predictors, and high spatial complexity of the site relative to the positioning accuracy of the groundtruth equipment. Some of these issues might be alleviated by creating an ensemble map that aggregates the individual outputs into one map showing the modal substrate class and its associated confidence or by adopting a quantitative approach that models the spatial distribution of sediment fractions. We conclude that further incremental improvements to the collection, processing and analysis of remote sensing and sample data are required to improve map accuracy. To assess the progress in benthic habitat mapping we propose the creation of benchmark datasets.