Abstract. Knowledge of plant species distributions is essential for various applications, such as nature conservation, agriculture, and forestry. Remote sensing data, especially high-resolution orthoimages from Unoccupied Aerial Vehicles (UAVs), were demonstrated to be an effective data source for plant species mapping. Particularly, in concert with novel pattern recognition methods, such as Convolutional Neural Networks (CNNs), plant species can be accurately segmented in such high-resolution UAV images. Training such pattern recognition models for species segmentation that are transferable across various landscapes and remote sensing data characteristics often requires excessive training data. Training data are usually derived in the form of segmentation masks from field surveys or visual interpretation of the target species in remote sensing images. Still, both methods are laborious and constrain the training of transferable pattern recognition models. Alternatively, pattern recognition models could be trained on the open knowledge of how plants look as available from smartphone-based species identification apps, that is, millions of citizen science-based smartphone photographs and the corresponding species label. However, these pairs of citizen science-based photographs and simple species labels (one label for the entire image) cannot be used directly for training state-of-the-art segmentation models used for UAV image analysis, which require per-pixel labels for training (also called masks). Here, we overcome the limitation of simple labels of citizen science plant observations with a two-step approach: In the first step, we train CNN-based image classification models using the simple labels and apply them in a moving-window approach over UAV orthoimagery to create segmentation masks. In the second phase, these segmentation masks are used to train state-of-the-art CNN-based image segmentation models with an encoder-decoder structure. We tested the approach on UAV orthoimages acquired in summer and autumn on a test site comprising ten temperate deciduous tree species in varying mixtures. Several tree species could be mapped with surprising accuracy (mean F1-score = 0.47). In homogenous species assemblages, the accuracy increased considerably (mean F1-score 0.55). The results indicate that many tree species can be mapped without generating training data and by integrating pre-existing knowledge from citizen science. Moreover, our analysis revealed that citizen science photographs’ variability in acquisition data and context facilitates the generation of models that are transferable through the vegetation season. Thus, citizen science data may greatly advance our capacity to monitor hundreds of plant species and, thus, Earth's biodiversity across space and time.