QTID: Quran Text Image Dataset

Badry, Mahmoud; Hassan, Hesham; Bayomi, Hanaa; Oakasha, Hussien

doi:10.14569/ijacsa.2018.090351

Cited by 4 publications

(3 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Type of Content Availability Size of Dataset ACTIV2 [12] Embedded words Public 10,415 text images QTID [13] Synthetic words Private 309,720 words and 249,428 characters IFN/ENIT [14] Handwritten words Public 115,000 words and 212,000 characters AHDB [15] Handwritten words and digits Private 30,000 words APTI [16] Printed words Public 113,284 words and 648,280 characters HACDB [17] Handwritten characters Public 6600 characters and 50 writers UPTI [18] Printed text lines Public 10,000 text lines Digital Jawi [19] Jawi paleography images Public 168 words and 1524 characters KHATT [20] Handwritten text lines Public 9327 lines, 165,890 words and 589,924 characters ALIF [21] Embedded text lines Upon request 1804 words and 89,819 characters ACTIV [22] Embedded text lines Public 4824 lines and 21,520 words SmartATID [23] Printed and handwritten pages Public 9088 pages Degraded historical [24] Handwritten documents Public 10 handwritten images and 10 printed images Printed PAW [25] Printed subwords Upon request 415,280 unique words and 550,000 sub words Checks [26] Handwritten subwords and digits Private 29,498 subwords and 15,148 digits Numeral [27] Handwritten digits Public 21,120 digits and 44 writers Forms [28] Handwritten characters Private 15,800 characters and 500 writers KAFD [29] Printed pages and lines Public 28,767 pages and 644,006 lines AHDBIFTR [30] Handwritten images Public 497 word images and 5 writers ARABASE [31] Handwritten text Public 47,000 words and 500 free Arabic sentences CEDAR [32] Handwritten pages Private 20,000 words, 10 writers, and 100 documents CENPARMI [26] Handwritten subwords and digits Public 6000 digit images Shafi and Zia [33] surveyed automatic Urdu text recognition techniques and described the algorithms, techniques, datasets, challenges, and future directions for Urdu OCR. Additionally, [34] reviewed the availability of datasets and suggested more training data to address the unique challenges of OCR systems.…”

Section: Datasetmentioning

confidence: 99%

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

et al. 2023

View full text Add to dashboard Cite

Optical character recognition (OCR) is the process of extracting handwritten or printed text from a scanned or printed image and converting it to a machine-readable form for further data processing, such as searching or editing. Automatic text extraction using OCR helps to digitize documents for improved productivity and accessibility and for preservation of historical documents. This paper provides a survey of the current state-of-the-art applications, techniques, and challenges in Arabic OCR. We present the existing methods for each step of the complete OCR process to identify the best-performing approach for improved results. This paper follows the keyword-search method for reviewing the articles related to Arabic OCR, including the backward and forward citations of the article. In addition to state-of-art techniques, this paper identifies research gaps and presents future directions for Arabic OCR.

show abstract

Section: Datasetmentioning

confidence: 99%

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

et al. 2023

View full text Add to dashboard Cite

show abstract

“…The Quran Text Image Dataset (QTID) (Badry et al 2018) is the first Arabic dataset that includes Arabic marks (diacritics). It consists of 309,720-word images with a dimension of 192×64.…”

Section: Arabic Optical Character Recognition Datasetmentioning

confidence: 99%

A Review of Arabic Text Recognition Dataset

Al-Sheikh¹,

Mohd²,

Warlina³

2020

APJITM

View full text Add to dashboard Cite

Building a robust Optical Character Recognition (OCR) system for languages, such as Arabic with cursive scripts, has always been challenging. These challenges increase if the text contains diacritics of different sizes for characters and words. Apart from the complexity of the used font, these challenges must be addressed in recognizing the text of the Holy Quran. To solve these challenges, the OCR system would have to undergo different phases. Each problem would have to be addressed using different approaches, thus, researchers are studying these challenges and proposing various solutions. This has motivate this study to review Arabic OCR dataset because the dataset plays a major role in determining the nature of the OCR systems. State-of-the-art approaches in segmentation and recognition are discovered with the implementation of Recurrent Neural Networks (Long Short-Term Memory-LSTM and Gated Recurrent Unit-GRU) with the use of the Connectionist Temporal Classification (CTC). This also includes deep learning model and implementation of GRU in the Arabic domain. This paper has contribute in profiling the Arabic text recognition dataset thus determining the nature of OCR system developed and has identified research direction in building Arabic text recognition dataset.

show abstract

“…Other datasets [8] [15] where such a system can figure out the characters and word on the lines. This paper introduces a public Quranic dataset on page and line level where the most unique about this dataset it focus on a diacritics text where only one dataset [11] introduce Quranic and diacritic dataset, unfortunately, this dataset is not public and it's synthetically generated on word level. In addition, what make this dataset more unique is, it based on Mushaf al Madinah benchmark which was written by the hand of Arabic calligraphy artist using the Uthmanic script.…”

Section: Related Workmentioning

confidence: 99%

A Quranic Dataset for Text Recognition

Al-Sheikh¹,

Mohd²

2019

Proceedings of the Proceedings of the 1st International Conference on Informatics, Engineering, Science and Technology, INCITES

View full text Add to dashboard Cite

Any text recognition or Optical Character Recognition (OCR) system requires a dataset to learn how to recognize the text. Due to the lack of a standard benchmark, most of the studies in this field were conducted using private datasets without a fair comparison. In this work, we used the standard Mushaf al Madinah benchmark where there are some rules in writing style, for example, the page should start with the beginning of verse and end with the end of verse. Following these rules make the words vary in size and paragraphs on different pages. These characteristics making the recognition of the Quranic text more challenging than the normal Arabic text, where the state of the art systems fails to recognize the Quranic text. Therefore, Quranic OCR dataset is presented in this study. It contains 604 images on page level and 8927 images in text-line level. This dataset is public and free to use for the research community. The Quranic dataset would help the researchers in the field of Arabic OCR where the dataset produced in this study would be made public and free for the use of research purposes.

show abstract

QTID: Quran Text Image Dataset

Cited by 4 publications

References 9 publications

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

A Review of Arabic Text Recognition Dataset

A Quranic Dataset for Text Recognition

Contact Info

Product

Resources

About