“…Deep learning-based object recognition algorithms, such as convolutional neural networks (CNNs), have achieved stateof-the-art performance in object recognition tasks, and, more recently, models such as Vision Transformer (ViT) are also achieving state-of-the-art (SOTA) performance [8,9]. These deep learning-based object recognition algorithms are highly dependent on the environmental factors which affect the quality of the training data, so model performance may deteriorate due to insufficient training data, large amounts of noise, and the presence of unlearned These deep learning-based object recognition algorithms are highly dependent on the environmental factors which affect the quality of the training data, so model performance may deteriorate due to insufficient training data, large amounts of noise, and the presence of unlearned environmental factors [10,11]. Therefore, it is important to make the environmental factors and quality of training data and input data the same [12].…”