The semantic segmentation of remotely sensed aerial imagery is nowadays an extensively explored task, concerned with determining, for each pixel in an input image, the most likely class label from a finite set of possible labels. Most previous work in the area has addressed the analysis of high-resolution modern images, although the semantic segmentation of historical grayscale aerial photos can also have important applications. Examples include supporting the development of historical road maps, or the development of dasymetric disaggregation approaches leveraging historical building footprints. Following recent work in the area related to the use of fully-convolutional neural networks for semantic segmentation, and specifically envisioning the segmentation of grayscale aerial imagery, we evaluated the performance of an adapted version of the W-Net architecture, which has achieved very good results on other types of image segmentation tasks. Our W-Net model is trained to simultaneously segment images and reconstruct, or predict, the colour of the input images from intermediate representations. Through experiments with distinct data sets frequently used in previous studies, we show that the proposed W-Net architecture is quite effective in colouring and segmenting the input images. The proposed approach outperforms a baseline corresponding to the U-Net model for the segmentation of both coloured and grayscale imagery, and it also outperforms some of the other recently proposed approaches when considering coloured imagery. K E Y W O R D S fully-convolutional neural networks, processing grayscale aerial photos, semantic segmentation of remotely sensed imagery, W-Net architecture 1 | INTRODUCTION Large amounts of high-resolution remote sensing images are nowadays acquired daily through satellites and aerial vehicles, and used as base data for mapping and Earth observation activities. An intermediate step for converting these raw images into map layers in vector format is semantic image segmentation, which is nowadays an extensively explored task concerned with determining, for each pixel, the most likely class label from a finite set of possible labels, corresponding to the desired object categories to map (e.g., discriminating pixels referring to roads, buildings, or vegetation, considering inputs such as high-resolution images depicting urban areas). Despite the interest in the area and the many recent significant advancements, the task remains quite challenging for automated approaches. In the particular case of urban areas, semantic segmentation involves not only dealing with objects that may be partially obscured by cloud coverage, but also with the fact that objects in cities can be small,