2011
DOI: 10.1007/978-3-642-25324-9_31
|View full text |Cite
|
Sign up to set email alerts
|

Instance Selection in Text Classification Using the Silhouette Coefficient Measure

Abstract: Abstract. The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 16 publications
0
4
0
Order By: Relevance
“…I(x,y) represents the mutual information between x and y, H(x) and H(y) are the entropy of x and y. NMI is defined as shown in Eq. ( 14): SC is another evaluation index of clustering results, originally proposed by Peter J. Rousseeuw in 1986 37 . It combines the two factors of intra cluster and inter-cluster, which can be calculated as shown in Eqs.…”
Section: Tciamentioning
confidence: 99%
“…I(x,y) represents the mutual information between x and y, H(x) and H(y) are the entropy of x and y. NMI is defined as shown in Eq. ( 14): SC is another evaluation index of clustering results, originally proposed by Peter J. Rousseeuw in 1986 37 . It combines the two factors of intra cluster and inter-cluster, which can be calculated as shown in Eqs.…”
Section: Tciamentioning
confidence: 99%
“…This preprocessing type is known as instance selection. The silhouette coefficient (Dey et al 2011) was used as the criterion for detecting potentially noisy signals:…”
Section: Instance Selectionmentioning
confidence: 99%
“…This value is helpful in denoting the cohesiveness of the data in one cluster and the separation of data in one cluster from those in the other clusters. This coefficient has been used in text classification not only to analyze the quality of the clustering but also as a feature selection technique [Dey et al, 2011]. In clustering tasks, the SC is calculated for each of the documents in the clusters in order to evaluate the clustering solution.…”
Section: Weighting Schemementioning
confidence: 99%