Accurate Stemming of Dutch for Text Classification

Gaustad, Tanja; Bouma, Gosse

doi:10.1163/9789004334038_010

Cited by 13 publications

(5 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If the word found to be illogical then it substitutes the suffix with the other words [4]. In the Dutch stemmer it uses a suffix stripping algorithm and dictionary lookup rule based methods [5].In the Nepali Stemming it uses a morphological analyzer which determines the given inflected word .In this it also tells about the Dawson stemming algorithm, krowertz algorithm [6].Lightweight Stemmer for Bengali also exists. In which it just strips the affix from the word without doing the complete morphological analysis.…”

Section: Related Workmentioning

confidence: 99%

Design and Development of a Stemmer for Punjabi

Kumar¹,

Rana²

2010

IJCA

View full text Add to dashboard Cite

Stemming is the process of removing the affixes from inflected words, without doing complete morphological analysis. A stemming Algorithm is a procedure to reduce all words with the same stem to a common form [20]. It is useful in many areas of computational linguistics and information-retrieval work. This technique is used by the various search engines to find the best solution for a problem. The algorithm is a basic building block for the stemmer. Stemmer is basically used in information retrieval system to improve the performance .The paper present a stemmer for Punjabi, which uses a brute force algorithm. We also use a suffix stripping technique in our paper. Similar techniques can be used to make stemmer for other languages such as Hindi, Bengali and Marathi. The result of stemmer is good and it can be effective in information retrieval system. This stemmer also reduces the problem of over-stemming and under-stemming.

show abstract

Section: Related Workmentioning

confidence: 99%

Design and Development of a Stemmer for Punjabi

Kumar¹,

Rana²

2010

IJCA

View full text Add to dashboard Cite

show abstract

“…The rule-based approach is a traditional method for stemming/lemmatisation (i.e. affix stripping) (Porter 1980;Gaustad and Bouma, 2002) and entails the use of language-specific rules to identify the base-forms (i.e. lemmas) of word forms.…”

Section: Lemmatisationmentioning

confidence: 99%

Using technology transfer to advance automatic lemmatisation for Setswana

Groenewald

2009

Proceedings of the First Workshop on Language Technologies for African Languages - AfLaT '09

View full text Add to dashboard Cite

South African languages (and indigenous African languages in general) lag behind other languages in terms of the availability of linguistic resources. Efforts to improve or fasttrack the development of linguistic resources are required to bridge this ever-increasing gap. In this paper we emphasize the advantages of technology transfer between two languages to advance an existing linguistic technology/resource. The advantages of technology transfer are illustrated by showing how an existing lemmatiser for Setswana can be improved by applying a methodology that was first used in the development of a lemmatiser for Afrikaans.

show abstract

“…For example, Gaustad and Bouma (2002) report results from experiments on Dutch email and news text classification using simple suffix stripping and a dictionary-based stemming. Neither method improved classification accuracy in their experiments.…”

Section: Stemmingmentioning

confidence: 99%

Classifying Amharic webnews

et al. 2009

View full text Add to dashboard Cite

We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose of the experiments has not primarily been to develop a cuttingedge text classification system for Amharic, but rather to put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps (SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification performance. We compared three representations while constructing classification models based on bagging of decision trees for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing between various categories actually is contained in the nouns, while stemming did not have much effect on the performance of the classifier.

show abstract

Accurate Stemming of Dutch for Text Classification

Cited by 13 publications

References 9 publications

Design and Development of a Stemmer for Punjabi

Design and Development of a Stemmer for Punjabi

Using technology transfer to advance automatic lemmatisation for Setswana

Classifying Amharic webnews

Contact Info

Product

Resources

About