2009
DOI: 10.1007/978-3-642-03226-4_2
|View full text |Cite
|
Sign up to set email alerts
|

Text and Hypertext Categorization

Abstract: Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0
1

Year Published

2010
2010
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 68 publications
(79 reference statements)
0
1
0
1
Order By: Relevance
“…Setelah tahap tokenizing, maka dilakukan tahap filtering yaitu dengan menghapus kata-kata yang sangat umum [9]. Kata yang termasuk dalam stopword contohnya adalah yang, dan, di, itu, dengan, untuk, tidak, dari, dalam, akan, pada, ini, juga, saya, serta, adalah, bahwa, lain, kamu, dan lain lain.…”
Section: Gambar 2 Tahap Preprocessingunclassified
“…Setelah tahap tokenizing, maka dilakukan tahap filtering yaitu dengan menghapus kata-kata yang sangat umum [9]. Kata yang termasuk dalam stopword contohnya adalah yang, dan, di, itu, dengan, untuk, tidak, dari, dalam, akan, pada, ini, juga, saya, serta, adalah, bahwa, lain, kamu, dan lain lain.…”
Section: Gambar 2 Tahap Preprocessingunclassified
“…There are five levels of representing the natural language document by means of a set of index. These are character, word, phrase, sentence or language/application specific levels (Benbrahim and Bramer, 2009). The basic and most widely-used approach for indexing is the use of word (token) level, in a process known as tokenization.…”
Section: Data Acquisitionmentioning
confidence: 99%