2012
DOI: 10.1002/meet.14504901118
|View full text |Cite
|
Sign up to set email alerts
|

Least information document representation for automated text classification

Abstract: We propose the Least Information theory (LIT) to quantify meaning of information in probability distributions and derive a new document representation model for text classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, LIT offers an information-centric approach to weight terms based on probability distributions in documents vs. in the collection. We develop two term weight quantities in the document classification context: 1) LI Binary (LIB) whi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 19 publications
0
5
0
Order By: Relevance
“…Text Rank was also developed around the same time as graph-based feature extraction. In the same vein as LexRank, TextRank is based on PageRank and uses pointless graphs, however, unlike LexRank, TextRank is limited to single documents and is a general, unsupervised extractive feature extraction system as part of the initial tokenization, the text is annotated with parts-of-speech tags, but individual words are only considered as possible additions to the graph [7]. Using this proposed method, a text unit will recommend other text units that fit within the same theme.…”
Section: Related Workmentioning
confidence: 99%
“…Text Rank was also developed around the same time as graph-based feature extraction. In the same vein as LexRank, TextRank is based on PageRank and uses pointless graphs, however, unlike LexRank, TextRank is limited to single documents and is a general, unsupervised extractive feature extraction system as part of the initial tokenization, the text is annotated with parts-of-speech tags, but individual words are only considered as possible additions to the graph [7]. Using this proposed method, a text unit will recommend other text units that fit within the same theme.…”
Section: Related Workmentioning
confidence: 99%
“…First, two widely used data sets were used as the benchmark, as follows: The Reuters 21578 Distribution 1.0 data set (Reuters) consists of 12,902 articles and 90 topic categories from the Reuters newswire (Aphinyanaphongs et al, 2014;Debole & Sebastiani, 2003;Gliozzo et al, 2005;Ke, 2012;Sun, Lim, & Ng, 2003;Sun, Lim, & Liu, 2009;Yang & Liu, 1999;Yu et al, 2003). Following other studies by Nigam (2001) and Joachims (1998), we built binary classifiers for each class to identify the news topic.…”
Section: Test Collection and Experimental Settingsmentioning
confidence: 99%
“…The Newsgroups data set (NG), collected by Lang, contains about 20,000 documents that are evenly divided among 20 UseNet discussion groups (Banerjee & Basu, 2007;Gliozzo et al, 2005;Ke, 2012;McCallum & Nigam, 1998;Sun et al, 2009;Yoon, Lee, & Lee, 2006). For a fair evaluation, we evaluated our scheme using the fivefold cross-validation method.…”
Section: Test Collection and Experimental Settingsmentioning
confidence: 99%
“…For example, Zhu et al [5,6] compiled a series of benchmarks for CTR tasks to help engineers determine whether certain methods are competitive in business. Weimao et al [7] sorted out multiple benchmarks of text classification tasks. They conducted classification experiments on this basis, so as to compare and analyze the advantages and disadvantages of various methods.…”
Section: Introductionmentioning
confidence: 99%