SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

Einea, Omar; Elnagar, Ashraf; Debsi, Ridhwan Al

doi:10.1016/j.dib.2019.104076

Cited by 67 publications

(29 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is a considerable gap when it comes to audio datasets when compared with the computer vision domain where large datasets such as MNIST 1 and ImageNet 2 became a baseline for researchers to evaluate their work, or text-based datasets on the rise including those dedicated for the Arabic language such as [1] .…”

Section: Experimental Design Materials and Methodsmentioning

confidence: 99%

Ar-DAD: Arabic diversified audio dataset

Lataifeh

Elnagar

2020

Data in Brief

Self Cite

View full text Add to dashboard Cite

The automatic identification and verification of speakers through representative audio continue to gain the attention of many researchers with diverse domains of applications. Despite this diversity, the availability of classified and categorized multi-purpose Arabic audio libraries is scarce. Therefore, we introduce a large Arabic-based audio clips dataset (15810 clips) of 30 popular reciters cantillating 37 chapters from the Holy Quran. These chapters have a variable number of verses saved to different subsequent folders, where each verse is allocated one folder containing 30 audio clips for the declared reciters covering the same textual content. An additional 397 audio clips for 12 competent imitators of the top reciters are collected based on popularity and number of views/downloads to allow for cross-comparison of text, reciters, and authenticity. Based on the volume, quality, and rich diversity of this dataset we anticipate a wide range of deployments for speaker identification, in addition to setting a new direction for the structure and organization of similar large audio clips dataset.

show abstract

Section: Experimental Design Materials and Methodsmentioning

confidence: 99%

Ar-DAD: Arabic diversified audio dataset

Lataifeh

Elnagar

2020

Data in Brief

Self Cite

View full text Add to dashboard Cite

show abstract

“…This is a recent dataset comprises of a large number of news documents wit ha total of around 195 thousand. Basic information is shown in table 3, and more information can be found in [45].…”

Section: Large-size Datasets With Original Textmentioning

confidence: 99%

“…One of the advantages of the methodology of this study is requiring no preprocessing for the input text documents; any TABLE 3. Details of the large-size datasets for SANAD [45].…”

Section: Data Preprocessingmentioning

confidence: 99%

A Superior Arabic Text Categorization Deep Model (SATCDM)

Alhawarat

Aseeri

2020

IEEE Access

View full text Add to dashboard Cite

Categorizing Arabic text documents is considered an important research topic in the field of Natural Language Processing (NLP) and Machine Learning (ML). The number of Arabic documents is tremendously increasing daily as new web pages, news articles, social media contents are added. Hence, classifying such documents in specific classes is of high importance to many people and applications. Convolutional Neural Network (CNN) is a class of deep learning that has been shown to be useful for many NLP tasks, including text translation and text categorization for the English language. Word embedding is a text representation currently used to represent text terms as real-valued vectors in vector space that represent both syntactic and semantic traits of text. Current research studies in classifying Arabic text documents use traditional text representation such as bag-of-words and TF-IDF weighting, but few use word embedding. Traditional ML algorithms have already been used in Arabic text categorization, and good results are achieved. In this study, we present a Multi-Kernel CNN model for classifying Arabic news documents enriched with n-gram word embedding, which we call A Superior Arabic Text Categorization Deep Model (SATCDM). The proposed solution achieves very high accuracy compared to current research in Arabic text categorization using 15 of freely available datasets. The model achieves an accuracy ranging from 97.58% to 99.90%, which is superior to similar studies on the Arabic document classification task.

show abstract

“…Text classification can be divided into single-label text classification and multilabel text classification according to the number of labels to which the text belongs. The single-label text refers to each text belonging to only one category, while multilabel text refers to each text belonging to one or more categories [ 51 – 53 ]. The calculation formula for text classification can be defined as follows:

…”

Section: Problem Statementmentioning

confidence: 99%

Application of BERT to Enable Gene Classification Based on Clinical Evidence

Xiang

et al. 2020

BioMed Research International

View full text Add to dashboard Cite

The identification of profiled cancer-related genes plays an essential role in cancer diagnosis and treatment. Based on literature research, the classification of genetic mutations continues to be done manually nowadays. Manual classification of genetic mutations is pathologist-dependent, subjective, and time-consuming. To improve the accuracy of clinical interpretation, scientists have proposed computational-based approaches for automatic analysis of mutations with the advent of next-generation sequencing technologies. Nevertheless, some challenges, such as multiple classifications, the complexity of texts, redundant descriptions, and inconsistent interpretation, have limited the development of algorithms. To overcome these difficulties, we have adapted a deep learning method named Bidirectional Encoder Representations from Transformers (BERT) to classify genetic mutations based on text evidence from an annotated database. During the training, three challenging features such as the extreme length of texts, biased data presentation, and high repeatability were addressed. Finally, the BERT+abstract demonstrates satisfactory results with 0.80 logarithmic loss, 0.6837 recall, and 0.705 F -measure. It is feasible for BERT to classify the genomic mutation text within literature-based datasets. Consequently, BERT is a practical tool for facilitating and significantly speeding up cancer research towards tumor progression, diagnosis, and the design of more precise and effective treatments.

show abstract

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

Cited by 67 publications

References 3 publications

Ar-DAD: Arabic diversified audio dataset

Ar-DAD: Arabic diversified audio dataset

A Superior Arabic Text Categorization Deep Model (SATCDM)

Application of BERT to Enable Gene Classification Based on Clinical Evidence

Contact Info

Product

Resources

About