2019
DOI: 10.1007/s10618-019-00635-1
|View full text |Cite
|
Sign up to set email alerts
|

Efficient mixture model for clustering of sparse high dimensional binary data

Abstract: Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
7
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
8
1

Relationship

1
8

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 70 publications
0
7
0
Order By: Relevance
“…This can occur in pangenomics as the discovery rate of new families in the pangenome slightly decreases when new genomes are added. Mathematical solutions to this problem seem to exist [50][51][52] for example via the weighting of genomes (based on their respective contribution to the pangenome diversity) or via sparse partitioning methods. An improvement of NEM should include these solutions and could be a perspective of this work.…”
Section: Issues Resulting From High-dimensional Statistics and Parallmentioning
confidence: 99%
“…This can occur in pangenomics as the discovery rate of new families in the pangenome slightly decreases when new genomes are added. Mathematical solutions to this problem seem to exist [50][51][52] for example via the weighting of genomes (based on their respective contribution to the pangenome diversity) or via sparse partitioning methods. An improvement of NEM should include these solutions and could be a perspective of this work.…”
Section: Issues Resulting From High-dimensional Statistics and Parallmentioning
confidence: 99%
“…Actually, it can be the case in pangenomics as the number of new families added to the pangenome slightly decreases when new genomes are added (see figure 3 in [1]). Mathematical solutions to this issue seem to exist [46,47,48] for example via the weighting of features, corresponding to the weighting of genomes in our case. An improved version of NEM should include this improvement and could be perspective of this work.…”
Section: Issues Resulting From High-dimensional Statisticsmentioning
confidence: 97%
“…Generally, clustering can be divided into five categories: partitioning [ 10 , 11 ], hierarchical [ 12 , 13 ], model-based [ 14 , 15 ], density-based [ 16 , 17 , 18 ], and grid-based algorithms. Partitioned clustering is designed to discover clusters in the data by optimizing a given objective function.…”
Section: Introductionmentioning
confidence: 99%