Initialization-similarity clustering algorithm

Liu, Tong; Zhu, Jingting; Zhou, Jukai; Zhu, Yongxin; Zhu, Xiaofeng

doi:10.1007/s11042-019-7663-8

Cited by 9 publications

(4 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Dual-index tags were used to classify the raw sequencing data by sample. Sequences read from the same locus were grouped using similarity clustering [51]. In general, only high-depth fragments were selected in each cluster group; low-depth segments were removed.…”

Section: Development Of Slaf Tags and Snp Markersmentioning

confidence: 99%

Investigation of Genetic Relationships within Miscanthus Using Snp Markers Identified Using Slaf-Seq

Chen¹,

Huang²,

Yi³

et al. 2020

SSRN Journal

View full text Add to dashboard Cite

Background: Miscanthus, which is a leading dedicated-energy grass in Europe and in parts of Asia, is expected to play a key role in the development of the future bioeconomy. However, due to its complex genetic background, it is di cult to investigate phylogenetic relationships and the evolution of gene function in this genus. Here, we investigated 50 Miscanthus germplasms: 1 female parent (M. lutarioriparius), 30 candidate male parents (M. lutarioriparius, M. sinensis, and M. sacchari orus), and 19 offspring. We used high-throughput Speci c-Locus Ampli ed Fragment sequencing (SLAF-seq) to identify informative single nucleotide polymorphisms (SNPs) in all germplasms.Results: We identi ed 800,081 SLAF tags, of which 160,368 were polymorphic. Each tag was 264-364 bp long. The obtained SNPs were used to investigate genetic relationships within Miscanthus. We constructed a phylogenetic tree of the 50 germplasms using the obtained SNPs, and found that the germplasms fell into two clades: one clade of M. sinensis only and one clade that included the offspring, M. lutarioriparius, and M. sacchari orus. Genetic cluster analysis indicated that M. lutarioriparius germplasm C3 was the most likely male parent of the offspring.Conclusions: As a high-throughput sequencing method, SLAF-seq can be used to identify informative SNPs in Miscanthus germplasms and to rapidly characterize genetic relationships within this genus. Our results will support the development of breeding programs utilizing Miscanthus cultivars with elite biomass-or ber-production potential.

show abstract

Section: Development Of Slaf Tags and Snp Markersmentioning

confidence: 99%

Investigation of Genetic Relationships within Miscanthus Using Snp Markers Identified Using Slaf-Seq

Chen¹,

Huang²,

Yi³

et al. 2020

SSRN Journal

View full text Add to dashboard Cite

show abstract

“…The execution time of the deterministic and incremental approaches has increased exponentially on multidimensional data 43 . The authors of References 19 and 44 demonstrated that nonrandomization and stable convergence produce excellent initial centroids. Nonrandomization and stable convergence are used to improve clustering performance and cluster initialization issues such as local optima, iterations, convergence speed, and so on.…”

Section: Introductionmentioning

confidence: 99%

“…Clustering is the most exploratory task in big data mining and is used in numerous domains such as character and pattern recognition, 7 image segmentation and processing, 8,9 text analysis, 10 video processing, 11 social network analysis, 12 bioinformatics, 13,14 recommendation task, 15 wireless sensor, 16 document clustering, 17,18 molecular biology, 7 pattern recognition, 7 psychology, 19 medicine, 19 gene expression grouping, 20 business analysis, 21 software evolution, 21 educational data analysis, 21 data reduction and compression, 7,21 climatology, 21 sequence analysis, 21 field robotics, 21 and so forth. The unstructured data types are converted into feature vectors according to clustering algorithm for big data clustering.…”

mentioning

confidence: 99%

Min‐max kurtosis stratum mean: An improved K‐means cluster initialization approach for microarray gene clustering on multidimensional big data

Pandey

Shukla

2022

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARY Microarray gene clustering is a big data application that employs the K‐means (KM) clustering algorithm to identify hidden patterns, evolutionary relationships, unknown functions and gene trends for disease diagnosis, tissue detection and biological analysis. The selection of initial centroids is a major issue in the KM algorithm because it influences the effectiveness, efficiency and local optima of the cluster. The existing initial centroid initialization algorithm is computationally expensive and degrades cluster quality due to the large dimensionality and interconnectedness of microarray gene data. To deal with this issue, this study proposed the min‐max kurtosis stratum mean (MKSM) algorithm for big data clustering in a single machine environment. The MKSM algorithm uses kurtosis for dimension selection, mean distance for gene relationship identification, and stratification for heterogeneous centroid extraction. The results of the presented algorithm are compared to the state‐of‐the‐art initialization strategy on twelve microarray gene datasets utilizing internal, external and statistical assessment criteria. The experimental results demonstrate that the MKSMKM algorithm reduces iterations, distance computation, data comparison and local optima, and improves cluster performance, effectiveness and efficiency with stable convergence.

show abstract

“…Clustering is used for segmenting or grouping data into clusters based on similarities and dissimilarities. Clustering is a multivariate statistical technique that achieves maximized within-cluster similarity and between-cluster dissimilarity based on similarity, dissimilarity and distance measures according to the nature of the data (Liu et al. , 2019).…”

Section: Introductionmentioning

confidence: 99%

NDPD: an improved initial centroid method of partitional clustering for big data mining

Pandey

Shukla

2022

JAMR

View full text Add to dashboard Cite

PurposeThe K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.Design/methodology/approachThis study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.FindingsThe performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.Originality/valueThe KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.

show abstract

Initialization-similarity clustering algorithm

Cited by 9 publications

References 52 publications

Investigation of Genetic Relationships within Miscanthus Using Snp Markers Identified Using Slaf-Seq

Investigation of Genetic Relationships within Miscanthus Using Snp Markers Identified Using Slaf-Seq

Min‐max kurtosis stratum mean: An improved K‐means cluster initialization approach for microarray gene clustering on multidimensional big data

NDPD: an improved initial centroid method of partitional clustering for big data mining

Contact Info

Product

Resources

About