Document clustering is a technique used to split the collection of textual content into clusters or groups. In modern days, generally, the spectral clustering is utilized in machine learning domain. By using a selection of text mining algorithms, the diverse features of unstructured content is captured for ensuing in rich descriptions. The main aim of this article is to enhance a novel unstructured text data clustering by a developed natural language processing technique. The proposed model will undergo three stages, namely, preprocessing, features extraction, and clustering. Initially, the unstructured data is preprocessed by the techniques such as punctuation and stop word removal, stemming, and tokenization. Then, the features are extracted by the word2vector using continuous Bag of Words model and term frequency-inverse document frequency. Then, unstructured features are performed by the hierarchical clustering using the optimizing the cut-off distance by the improved sensing area-based electric fish optimization (FISA-EFO). Tuned deep neural network is used for improving the clustering model, which is proposed by same algorithm. Thus, the results reveal that the model provides better clustering accuracy than other clustering techniques while handling the unstructured text data. K E Y W O R D S fitness improved sensing area-based electric fish optimization, hierarchical clustering, tuned deep neural network, unstructured text data clustering
INTRODUCTIONGenerally, speech and text data are read by humans easily, but the machine learning and statistical modeling applications have some unstructured data and so, it is necessary to do some alterations in the coded input feature sets. 1 Data clustering is a technique used for splitting the data elements into many groups so that the elements in the same group have the highest similarity. Though, based on the cluster's attributes, there are diverse elements in other groups. The major aim of clustering techniques is to get centroids or cluster centers for characterizing the entire cluster. Few of the clustering techniques were performed and classified from different scenarios such as "density-based methods, grid-based methods, partitioning methods, and hierarchical methods." 2,3 Moreover, the data set is defined as categorical or numerical. The primary statistical features of numeric data are utilized for describing the distance function between data elements. The categorical data is imitated from the qualitative and quantitative data, and then the descriptions are attained from the counts. 4 By using a "textual virtual schematic model" (TVSM), the textual data are assigned in clusters and it follows three steps. Initially, the extraction of unstructured data is carried out from the data source, and then, it is changed into structured data. 5 After that, clustering is implemented on structured data. Finally, the comparison of documents is done for enhancing the performance of the query based on accuracy.The day today's life generates a huge amount of unstructured text...