Least information document representation for automated text classification

Wu, Ke

doi:10.1002/meet.14504901118

Cited by 5 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Text Rank was also developed around the same time as graph-based feature extraction. In the same vein as LexRank, TextRank is based on PageRank and uses pointless graphs, however, unlike LexRank, TextRank is limited to single documents and is a general, unsupervised extractive feature extraction system as part of the initial tokenization, the text is annotated with parts-of-speech tags, but individual words are only considered as possible additions to the graph [7]. Using this proposed method, a text unit will recommend other text units that fit within the same theme.…”

Section: Related Workmentioning

confidence: 99%

A Novel Hierarchical Document Clustering Framework on Large TREC Biomedical Documents

Kumari¹,

Jeeva²,

Satyanarayana³

2022

IJITCS

View full text Add to dashboard Cite

The growth of microblogging sites such as Biomedical, biomedical, defect, or bug databases makes it difficult for web users to share and express their context identification of sequential key phrases and their categories on text clustering applications. In the traditional document classification and clustering models, the features associated with TREC texts are more complex to analyze. Finding relevant feature-based key phrase patterns in the large collection of unstructured documents is becoming increasingly difficult, as the repository's size increases. The purpose of this study is to develop and implement a new hierarchical document clustering framework on a large TREC data repository. A document feature selection and clustered model are used to identify and extract MeSH related documents from TREC biomedical clinical benchmark datasets. Efficiencies of the proposed model are indicated in terms of computational memory, accuracy, and error rate, as demonstrated by experimental results.

show abstract

Section: Related Workmentioning

confidence: 99%

A Novel Hierarchical Document Clustering Framework on Large TREC Biomedical Documents

Kumari¹,

Jeeva²,

Satyanarayana³

2022

IJITCS

View full text Add to dashboard Cite

show abstract

“…First, two widely used data sets were used as the benchmark, as follows: The Reuters 21578 Distribution 1.0 data set (Reuters) consists of 12,902 articles and 90 topic categories from the Reuters newswire (Aphinyanaphongs et al, 2014;Debole & Sebastiani, 2003;Gliozzo et al, 2005;Ke, 2012;Sun, Lim, & Ng, 2003;Sun, Lim, & Liu, 2009;Yang & Liu, 1999;Yu et al, 2003). Following other studies by Nigam (2001) and Joachims (1998), we built binary classifiers for each class to identify the news topic.…”

Section: Test Collection and Experimental Settingsmentioning

confidence: 99%

“…The Newsgroups data set (NG), collected by Lang, contains about 20,000 documents that are evenly divided among 20 UseNet discussion groups (Banerjee & Basu, 2007;Gliozzo et al, 2005;Ke, 2012;McCallum & Nigam, 1998;Sun et al, 2009;Yoon, Lee, & Lee, 2006). For a fair evaluation, we evaluated our scheme using the fivefold cross-validation method.…”

Section: Test Collection and Experimental Settingsmentioning

confidence: 99%

A new term‐weighting scheme for text classification using the odds of positive and negative class probabilities

2015

Asso for Info Science & Tech

View full text Add to dashboard Cite

The peculiarity of text classification that differs most from information retrieval is the existence of class information. Therefore, this paper proposes a new term weighting scheme that utilizes class information using positive and negative class distributions. As a result, the proposed scheme, log tf.TRR, consistently performs better than other schemes using class information, as well as traditional schemes such as tf.idf.

show abstract

“…For example, Zhu et al [5,6] compiled a series of benchmarks for CTR tasks to help engineers determine whether certain methods are competitive in business. Weimao et al [7] sorted out multiple benchmarks of text classification tasks. They conducted classification experiments on this basis, so as to compare and analyze the advantages and disadvantages of various methods.…”

Section: Introductionmentioning

confidence: 99%

When to Use Large Language Model: Upper Bound Analysis of BM25 Algorithms in Reading Comprehension Task

Liu¹,

Xiong²,

Zhang³

2023

Preprint

View full text Add to dashboard Cite

Large language model (LLM) is a representation of a major advancement in AI, and has been used in multiple natural language processing tasks. Nevertheless, in different business scenarios, LLM requires fine-tuning by engineers to achieve satisfactory performance, and the cost of achieving target performance and fine-tuning may not match. Based on the Baidu STI dataset, we study the upper bound of the performance that classical information retrieval methods can achieve under a specific business, and compare it with the cost and performance of the participating team based on LLM. This paper gives an insight into the potential of classical computational linguistics algorithms, and which can help decision-makers make reasonable choices for LLM and low-cost methods in business R&D.

show abstract

Least information document representation for automated text classification

Cited by 5 publications

References 19 publications

A Novel Hierarchical Document Clustering Framework on Large TREC Biomedical Documents

A Novel Hierarchical Document Clustering Framework on Large TREC Biomedical Documents

A new term‐weighting scheme for text classification using the odds of positive and negative class probabilities

When to Use Large Language Model: Upper Bound Analysis of BM25 Algorithms in Reading Comprehension Task

Contact Info

Product

Resources

About