2019
DOI: 10.1016/j.dib.2019.104076
|View full text |Cite
|
Sign up to set email alerts
|

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

Abstract: Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
29
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 67 publications
(29 citation statements)
references
References 3 publications
0
29
0
Order By: Relevance
“…There is a considerable gap when it comes to audio datasets when compared with the computer vision domain where large datasets such as MNIST 1 and ImageNet 2 became a baseline for researchers to evaluate their work, or text-based datasets on the rise including those dedicated for the Arabic language such as [1] .…”
Section: Experimental Design Materials and Methodsmentioning
confidence: 99%
“…There is a considerable gap when it comes to audio datasets when compared with the computer vision domain where large datasets such as MNIST 1 and ImageNet 2 became a baseline for researchers to evaluate their work, or text-based datasets on the rise including those dedicated for the Arabic language such as [1] .…”
Section: Experimental Design Materials and Methodsmentioning
confidence: 99%
“…This is a recent dataset comprises of a large number of news documents wit ha total of around 195 thousand. Basic information is shown in table 3, and more information can be found in [45].…”
Section: Large-size Datasets With Original Textmentioning
confidence: 99%
“…One of the advantages of the methodology of this study is requiring no preprocessing for the input text documents; any TABLE 3. Details of the large-size datasets for SANAD [45].…”
Section: Data Preprocessingmentioning
confidence: 99%
“…Text classification can be divided into single-label text classification and multilabel text classification according to the number of labels to which the text belongs. The single-label text refers to each text belonging to only one category, while multilabel text refers to each text belonging to one or more categories [ 51 – 53 ]. The calculation formula for text classification can be defined as follows: …”
Section: Problem Statementmentioning
confidence: 99%