Maize is a crop that is widely cultivated all over the world. Thus, the classification of maize seeds quality is important, while the traditional methods based on the texture, shape, and color which require repeated work is not efficient. Recently, deep learning reached the goal in the field of image processing, and a deep convolutional neural network (DCNN) is often used to do the image classification task. Here, we explored another neural network called Vision Transformer (ViT), which originally was applied to the natural language processing. Based on the self‐attention mechanism, ViT discards the convolutional structure. But when trained from scratch on medium‐sized datasets, ViT performed poorly compared to CNN. Due to the lack of local structure within the input image, tokenization cannot be used to generate a valid training set in the original ViT model. As a result, we proposed an improved ViT model SeedViT. Compared with the original ViT which could only train large datasets, SeedViT can train small and medium datasets to achieve SOTA (State of the Art) in vision classification with only 2,500 images in our study. The feasibility of SeedViT to classify maize seeds’ quality was studied in this article, and we compared it with DCNN and traditional machine learning algorithms. The accuracy, sensitivity, specificity, and precision were 97.6%, 94.1%, 98.9%, and 97%, respectively. In addition, we employed ViT and VGG (Visual Geometry Group, a convolutional neural network) to extract image features, and SVM (support vector machine) was used as the classifier to classify them, with the result that ViT‐SVM was stable around 96.6% on the test set and VGG‐SVM was stable around 94.6%. At last, a visual attention map was generated by visualization technology. It showed that SeedViT can be a new and novel way for maize seed manufacturing.
Practical applications
An algorithm for classifying and sorting out high‐quality maize seeds.
Use the Transformer algorithm from the field of natural language processing instead of the convolutional neural network algorithm.
Use GeLU function for activation and soft split with images to improve Vision Transformer model so that it can achieve great performance with small and medium datasets.