2008
DOI: 10.1007/s10791-008-9083-7
|View full text |Cite
|
Sign up to set email alerts
|

Using the Web as corpus for self-training text categorization

Abstract: Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This metho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
7
0
2

Year Published

2009
2009
2018
2018

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 18 publications
0
7
0
2
Order By: Relevance
“…In line with these current works, we have proposed a new semi-supervised method for general text classification tasks [5]. This method differs from previous approaches in two main issues.…”
Section: Introductionmentioning
confidence: 88%
“…In line with these current works, we have proposed a new semi-supervised method for general text classification tasks [5]. This method differs from previous approaches in two main issues.…”
Section: Introductionmentioning
confidence: 88%
“…Therefore, Automatic text categorization plays an important role in helping information users overcome such a challenge by reducing the time needed to classify thousands of daily arrived documents, without the need for experts. Thus, Automatic TC can significantly reduce the cost and effort of manual categorization [3]. For example, it has been reported in the Internet World Stats (http://www.internetworldstats.com/stats7.htm) that the number of Arabic speaking Internet users has grown 2,501.2 % in the last eleven years (2000-2011), which is the highest growth rate among other languages.…”
Section: Introductionmentioning
confidence: 99%
“…The first approach allow building a classifier by considering a small set of tagged documents along with a great number of unlabeled texts (Nigam et al, 2000;Krithara, et al, 2008;Guzmán-Cabrera et al, 2009). The second focuses on the construction of classifiers by reusing training sets from related domains (Aue and Gamon, 2005;Dai et al, 2007).…”
Section: Introductionmentioning
confidence: 99%