Statistical Identification of Key Phrases for Text Classification

Coenen, Frans; Leng, Paul; Sanderson, Robert; Wang, Yanbo J.

doi:10.1007/978-3-540-73499-4_63

Cited by 15 publications

(18 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Accuracy figures, describing the proportion of correctly classified "unseen" documents, were obtained using the Ten-fold Cross Validation (TCV). A support threshold value of 0.1% and a Lower Noise Threshold (LNT) value of 0.2% were used, as suggested in [6]. A confidence threshold value of 50% was used (as proposed in the published evaluations of a number of associative classification studies [5,15,28]).…”

Section: Resultsmentioning

confidence: 99%

“…the first k words for each predefined class) that are selected from the ordered list of potential significant words (in a descending manner based on their contribution value) are defined to be significant words. In [6] the authors (based on the above definitions) propose a statistical "bag of phrases" (DR) approach for TC, namely DelSNcontGO: phrases are Delimited by stop marks (S) and/or noise words (N), and (as phrase contents) made up of sequences of one or more significant words (G) and ordinary words (O); sequences of ordinary words delimited by stop marks and/or noise words that do not include at least one significant word (in the contents) are ignored. The experimental results presented in [6] show that DelSNcontGO performs well with respect to the accuracy of classification.…”

Section: Significant Words (G)mentioning

confidence: 99%

“…In [6] the authors (based on the above definitions) propose a statistical "bag of phrases" (DR) approach for TC, namely DelSNcontGO: phrases are Delimited by stop marks (S) and/or noise words (N), and (as phrase contents) made up of sequences of one or more significant words (G) and ordinary words (O); sequences of ordinary words delimited by stop marks and/or noise words that do not include at least one significant word (in the contents) are ignored. The experimental results presented in [6] show that DelSNcontGO performs well with respect to the accuracy of classification. In this paper, this statistical "bag of phrases" DR approach will be further concerned in the section of experimental results.…”

Section: Significant Words (G)mentioning

confidence: 99%

“…; (3) W GLO " read ! to create a global word set, where the word document-base support supp GLO is associated with each word u h in W GLO ; (4) for each C i ∈ C do (5) W LOC " read documents that reference C i to create a local word set, where the local support supp LOC is associated with each word u h in W LOC ; (6) for each word u h ∈ W LOC do (7) contribution " (u h .supp LOC / u h .supp GLO …”

Section: Proposed Statistical Feature Selectionmentioning

confidence: 99%

“…[6,29]). In [3] Antonie and Zaïane argue: an associative text classifier "is fast during both training and categorization phases", especially when handling large document-bases; and such classifiers "can be read, understood and modified by humans".…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

Wang¹,

Coenen

Sanderson

2009

Advanced Data Mining and Applications

Self Cite

View full text Add to dashboard Cite

Abstract. Data pre-processing is an important topic in Text Classification (TC).It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between textcategories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (i) linguistic, (ii) statistical, and (iii) hybrid (i) & (ii). With regard to language-independent TC, our study relates to the statistical aspect only. The nature of textual data pre-processing includes: Document-base Representation (DR) and Feature Selection (FS). In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti⋅Sebastiani⋅Simi Coefficient). Our proposed approach is presented under a statistical "bag of phrases" DR setting. The experimental results, based on the well-established associative text classification approach, demonstrate that our proposed technique outperforms existing mechanisms with respect to the accuracy of classification.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Significant Words (G)mentioning

confidence: 99%

Section: Significant Words (G)mentioning

confidence: 99%

Section: Proposed Statistical Feature Selectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

Wang¹,

Coenen

Sanderson

2009

Advanced Data Mining and Applications

Self Cite

View full text Add to dashboard Cite

show abstract

Harnessing Background Knowledge for E-Learning Recommendation

Mbipom

Craw

Massie

2016

Research and Development in Intelligent Systems XXXIII

View full text Add to dashboard Cite

Section 6 of the "Repository policy for OpenAIR @ RGU" (available from http://www.rgu.ac.uk/staff-and-currentstudents/library/library-policies/repository-policies) provides guidance on the criteria under which RGU will consider withdrawing material from OpenAIR. If you believe that this item is subject to any of these criteria, or for any other reason should not be held on OpenAIR, then please contact openair-help@rgu.ac.uk with the details of the item and the nature of your complaint. Abstract The growing availability of good quality, learning-focused content on the Web makes it an excellent source of resources for e-learning systems. However, learners can find it hard to retrieve material well-aligned with their learning goals because of the difficulty in assembling effective keyword searches due to both an inherent lack of domain knowledge, and the unfamiliar vocabulary often employed by domain experts. We take a step towards bridging this semantic gap by introducing a novel method that automatically creates custom background knowledge in the form of a set of rich concepts related to the selected learning domain. Further, we develop a hybrid approach that allows the background knowledge to influence retrieval in the recommendation of new learning materials by leveraging the vocabulary associated with our discovered concepts in the representation process. We evaluate the effectiveness of our approach on a dataset of Machine Learning and Data Mining papers and show it to outperform the benchmark methods.

show abstract

Document-Base Extraction for Single-Label Text Classification

Wang

Sanderson

Coenen

et al.

Data Warehousing and Knowledge Discovery

Self Cite

View full text Add to dashboard Cite

Abstract. Many text mining applications, especially when investigating TextClassification (TC), require experiments to be performed using common textcollections, such that results can be compared with alternative approaches. With regard to single-label TC, most text-collections (textual data-sources) in their original form have at least one of the following limitations: the overall volume of textual data is too large for ease of experimentation; there are many predefined classes; most of the classes consist of only a very few documents; some documents are labeled with a single class whereas others have multiple classes; and there are documents found with little or no actual text-content. In this paper, we propose a standard approach to automatically extract "qualified" document-bases from a given textual data-source that can be used more effectively and reliably in single-label TC experiments. The experimental results demonstrate that document-bases extracted based on our approach can be used effectively in single-label TC experiments.

show abstract

Statistical Identification of Key Phrases for Text Classification

Cited by 15 publications

References 8 publications

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

Harnessing Background Knowledge for E-Learning Recommendation

Document-Base Extraction for Single-Label Text Classification

Contact Info

Product

Resources

About