Deep convolutional neural networks, being very efficient for computer vision tasks using much training data, still struggle with small training datasets. Therefore, we need a training pipeline that handles rare object types and an overall lack of training data to build well-performing models that provide stable predictions. This paper presents a comprehensive framework XtremeAugment that provides an easy, reliable, and scalable way to collect image datasets and efficiently label and augment collected data. The presented framework consists of two augmentation techniques that can be used independently and complement each other when used together. They are, namely, Hardware Dataset Augmentation (HDA) and Object-Based Augmentation (OBA). HDA is a technique that allows collecting more data and spending less time on manual data labeling. OBA significantly increases training data variability and remains the distribution of the augmented images close to the original dataset. We check the performance of the proposed approach and its independent parts on the apple spoil segmentation problem. Our results show a substantial increase in model accuracy, reaching 0.91 F1-score and outperforming the baseline model on up to 0.62 F1-score for a few-shot learning case in the wild data. The highest benefit from the XtremeAugment raises on the cases where we collect images in the controlled indoor environment but have to use the model in the wild.