As a rule, specially trained neural networks are engaged in the recognition and classification of objects in RGB and RGB-D images. The quality of object recognition depends on the quality of neural network training. Since the network cannot go far beyond the limits of the training set, the problems of forming datasets and their correct annotating are of particular relevance. These tasks are time-consuming and can be difficult to perform in real-world conditions since it is not always possible to create the required real-world illumination and observation conditions. So, synthesized images with a high degree of realism can be used as input data for deep learning. To synthesize realistic images, it is necessary to create appropriate realistic models of scene objects, illumination and observation conditions, including ones achieved with special optical devices. However, this is not enough to create a dataset, since it is necessary to generate thousands of images, which is hardly possible to do manually. Therefore, an automated solution is proposed, which allows us to automatically process the scene to observe it from different angles, modify the scene by adding, deleting, moving, or rotating individual objects, and then perform automatic annotation of a desired scene image. As a result, not only directly visible scene object images but also their reflections may be annotated. In addition to the segmented image, a segmented point cloud and a depth map image (RGB-D) are built, which helps in training neural networks working with such data. For this, a Python scripting interpreter was built-in into a realistic rendering system. It allows us to perform any actions with the scene that are allowed in the user interface to control the automatic synthesis and segmentation of images. The paper provides examples of automatic dataset generation and corresponding trained neural network results.