Improving the accuracy of intelligent data analysis is an important task in various application areas. Existing machine learning methods do not always provide a sufficient level of classification accuracy for their use in practice. That is why, in recent years, hybrid ensemble methods of intellectual data analysis have begun to develop. They are based on the combined use of clustering and classification procedures. This approach provides an increase in the accuracy of the classifier based on machine learning due to the expansion of the space of the input data of the task by the results of the clustering.
In this paper, the tasks of modification and improvement of such technology for small data analysis are considered. The basis of the modification is the use of clustering with output at the first step of the method to increase the accuracy of the entire technology. Despite the high accuracy of the work, this approach requires a significant expansion of the inputs of the final linear classifier (labels of the obtained clusters are added to the initial inputs). To avoid this shortcoming, the paper proposes an improvement based on the introduction of a new classification procedure at the first step of the method and replaces all the initial inputs of the task with the results of its work. In parallel with it, clustering is performed taking into account the original attribute, the results of which are added to the output of the classifier of the first step. In this way, the formation of an expanded set of data of significantly lower dimensionality in comparison with the existing method takes place (here there is no longer a large number of initial features, which is characteristic of biomedical engineering tasks). This reduces the training time of the method and increases its generalization properties.
Modeling of the method was based on the use of a short dataset contained in an open repository. After the preprocessing procedures, the dataset has only 294 vectors, each of which was characterized by 18 attributes. Data classification was done using an SGTM-based neural-like structure classifier. This linear classifier provides high accuracy of work. In addition, it does not provide for the implementation of an iterative training procedure and additional adjustment of work parameters. Data clustering was performed using the k-means method. This choice is due to both the simplicity and speed of its work.
The search for the optimal number of k-means clusters was carried out using 4 different methods. They all showed different results. That is why, some experiments were conducted to assess the influence of different numbers of clusters (from 3 to 7) on the accuracy of all 4 algorithms of the developed technology. The accuracy of the proposed technology has been established experimentally in comparison with the linear classifier and the existing hybrid method. In addition, by reducing the inputs of the final classifier, the developed technology reduces the duration of the training procedure compared to the basic method. All this ensures the possibility of using the proposed technology when solving various applied problems of medical diagnostics, in particular, based on the analysis of small data.
Keywords: small data approach, non-iterative training, ensemble learning, unsupervised-supervised technology, biomedical engineering.