2015 7th Computer Science and Electronic Engineering Conference (CEEC) 2015
DOI: 10.1109/ceec.2015.7332690
|View full text |Cite
|
Sign up to set email alerts
|

A new term weighting scheme based on class specific document frequency for document representation and classification

Abstract: Document classification is usually more challenging than numerical data classification, because it is much more difficult to effectively represent documents than numerical data for classification purposes. Vector space model (VSM) has been widely used for document representation for classification, in which a document is represented by a vector of feature values based on a bag of words. This paper proposes a new feature for document representation under the VSM framework, class specific document frequency (CSD… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…After pre-processing, the total number of words for the emails_v1 dataset, emails_v2 dataset, 20 newsgroups dataset, and routers dataset was 1,014, 465, 2,591, and 412 respectively. After that, four term-weighting schemes were applied to the words in the BOW: term frequency (TF), term presence (TP), term frequency and inverse document frequency (TF-IDF), [36] and term presence and class-specific document frequency (TP-CSDF), [37] to generate numerical features. The other three datasets are numerical.…”
Section: Experiments Proceduresmentioning
confidence: 99%
“…After pre-processing, the total number of words for the emails_v1 dataset, emails_v2 dataset, 20 newsgroups dataset, and routers dataset was 1,014, 465, 2,591, and 412 respectively. After that, four term-weighting schemes were applied to the words in the BOW: term frequency (TF), term presence (TP), term frequency and inverse document frequency (TF-IDF), [36] and term presence and class-specific document frequency (TP-CSDF), [37] to generate numerical features. The other three datasets are numerical.…”
Section: Experiments Proceduresmentioning
confidence: 99%
“…[14] In this paper, the effectiveness of CSDF will be further investigated for web document representation for classification purposes. The basic idea of CSDF is that a term in a document is very important for classifying documents if it is more frequent inside the document and other documents belonging to the same class as well but less frequent in documents belonging to different classes.…”
Section: Csdf For Web Document Representationmentioning
confidence: 99%
“…For document representation, the class specific document frequency (CSDF) weighting method is adopted, which has been demonstrated to effectively improve the performance of document classification in comparison with other widely used vector space model (VSM) based document representations. [14] This paper proposes a new ranking method called GCrank that combines the original Google ranking scores and the LDA classification scores of the Google search returned web documents to improve ranking performance, which is demonstrated by experimental results in terms of several widely used ranking performance criteria.…”
Section: Introductionmentioning
confidence: 99%