Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 1995
DOI: 10.1145/215206.215349
|View full text |Cite
|
Sign up to set email alerts
|

Little words can make a big difference for text classification

Abstract: Most information retrieval systems use stopword lists and stemming algorithms. However, we have found that recognizing singular and plural nouns, verb forms, negation, and prepositions can produce dramatically different text classification results. We present results from text classification experiments that compare relevancy signatures, which use local linguistic context, with corresponding indexing terms that do not. In two different domains, relevancy signatures produced better results than the simple index… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
24
0
2

Year Published

2001
2001
2020
2020

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 69 publications
(26 citation statements)
references
References 8 publications
0
24
0
2
Order By: Relevance
“…Using the same pre-processing but in addition lemmatizing the noun and verb forms in the documents, the results are as follows: algorithm representation language Accuracy Multi 0:3 In distinction to the situation in query-based Retrieval, in Text Categorization the lemmatization of terms does not seem to improve the Accuracy: although lemmatization enhances the Recall of terms, it may well hurt Precision more (see also [15]). In Text Categorization the positive effect of the conflation of morphological variants of a word is small: If two forms of a word are both important terms for a class, then they will both obtain an appropriate positive weight for that class provided they occur often enough, and if they don't occur often enough, their contribution is not important anyway.…”
Section: Lemmatized Abstractmentioning
confidence: 99%
“…Using the same pre-processing but in addition lemmatizing the noun and verb forms in the documents, the results are as follows: algorithm representation language Accuracy Multi 0:3 In distinction to the situation in query-based Retrieval, in Text Categorization the lemmatization of terms does not seem to improve the Accuracy: although lemmatization enhances the Recall of terms, it may well hurt Precision more (see also [15]). In Text Categorization the positive effect of the conflation of morphological variants of a word is small: If two forms of a word are both important terms for a class, then they will both obtain an appropriate positive weight for that class provided they occur often enough, and if they don't occur often enough, their contribution is not important anyway.…”
Section: Lemmatized Abstractmentioning
confidence: 99%
“…Within this framework, research shows that verb forms and prepositions play key roles in indicating classification term meanings. 7 However, because image captions are so concise, each word has an extremely high information content; therefore, using keyword-based approaches that ignore both syntactic and semantic information in captions will simply fail to differentiate photographs and will often yield incorrect indexing. Using syntactic relations expressed in captions is definitely more efficient but still often yields incorrect indexing.…”
Section: Existing Approaches Fall Shortmentioning
confidence: 99%
“…These extraction patterns are domain-dependent linguistic expressions consisting of a trigger word, conditions to be met and case roles. These patterns are considered to be dependent on the syntactic context of tokens; verb forms and prepositions are considered important indicators of the meaning of classification terms [9]. In particular, extraction patterns that contain prepositions are reported to have much higher correlation with relevant texts than their corresponding trigger words.…”
Section: Image Caption Retrieval Approaches: a Brief Reviewmentioning
confidence: 99%