An improved term weighting scheme for text classification

Zhong, Tang; Li, Wenqiang; Li, Yan

doi:10.1002/cpe.5604

Cited by 19 publications

(16 citation statements)

References 65 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, feature selection is the recommended approach by practitioners and researchers [48]. In contrast, previous work also indicates that the performance tends to increase if more features are used [47]. It would be interesting to investigate if feature selection can improve the accuracy for small datasets using our experimental approach [37].…”

Section: Discussionmentioning

confidence: 99%

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Riekert

Klein

2021

SN COMPUT. SCI.

View full text Add to dashboard Cite

Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.

show abstract

Section: Discussionmentioning

confidence: 99%

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Riekert

Klein

2021

SN COMPUT. SCI.

View full text Add to dashboard Cite

show abstract

“…In other aspects, Zhong Tang et al described two deficiencies from which TF-IDF suffers, namely, collection frequency factor being undefined (division by zero) or being equal to zero in some special cases. ey proposed a novel method, namely, term frequency-inverse exponential frequency (TF-IEF), to overcome these drawbacks [14]. e proposed methods replaced the IDF with a global weighting factor IEF, and a log-like method is used to characterize the collection frequency factor.…”

Section: Background Studymentioning

confidence: 99%

“…"Good" term weighting methods are of fundamental importance for guaranteeing good TC performance. So far, there are two main categories of TWSs in the literature: semantic-based TWSs and statistics-based TWSs [14].…”

Section: Introductionmentioning

confidence: 99%

“…Some popular examples are shown in Table 1, where values of "NONE" indicate there is no corresponding method for the specific parameter. As the table shows, some methods focus on modifying the term frequency factor (i.e., LogTF-RF [11] and SQRT_TF-IGM [25]), while some focus on developing novel methods as the collection frequency factor (i.e., TF-IDF [26], TF-CHI2 [27], TF-IEF [14], and TF-IGM [25]). Nevertheless, TF-IDF is still one of the most preferred methods.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports

Jiang

et al. 2021

Mathematical Problems in Engineering

View full text Add to dashboard Cite

With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF ¯ , namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.

show abstract

“…Text representation is a necessary and primary procedure in performing ATC and OM systems. It first needs to be obtained through an information‐rich term weighting scheme to achieve higher performance 12 . Most known techniques for text representation form vectors that contain many zeros as most terms appear in a small number of texts.…”

Section: Introductionmentioning

confidence: 99%

Supervised classification by thresholds: Application to automated text categorization and opinion mining

Cherif¹,

Madani

Kissi³

2021

Concurrency and Computation

View full text Add to dashboard Cite

Over recent years, the world has experienced explosive growth in the volume of textual data, which makes a manual analysis impossible. Machine learning techniques provided an effective solution to this problem. Due to its capacity to organize the huge and varied amounts of data, it offered valuable insights and it has become an emerging investigative field for the research community. Classification techniques are used to classify data into different classes according to desired criteria. By their simplicity, they give rise to a variety of applications: automated text categorization, opinion mining, and so forth. These processes go through three stages: text representation, features extraction, and the classification process; they still face many difficulties due both to the complex nature of text databases and to the high dimensionality of texts representations. This article presents a new classification approach that learns to classify texts from the most reliable features more accurately. The added advantage of the proposed approach is that it automatically classifies a text without necessarily processing all its features. The experimental results showed that this new classification by thresholds outperforms the state‐of‐the‐art methods. As a result, the obtained f‐measure on automatic text categorization was 95.06% while it is lower on opinion mining.

show abstract

An improved term weighting scheme for text classification

Cited by 19 publications

References 65 publications

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports

Supervised classification by thresholds: Application to automated text categorization and opinion mining

Contact Info

Product

Resources

About