2018
DOI: 10.14569/ijacsa.2018.090351
|View full text |Cite
|
Sign up to set email alerts
|

QTID: Quran Text Image Dataset

Abstract: Improving the accuracy of Arabic text recognition in imagery requires a big modern dataset as data is the fuel for many modern machine learning models. This paper proposes a new dataset, called QTID, for Quran Text Image Dataset, the first Arabic dataset that includes Arabic marks. It consists of 309,720 different 192x64 annotated Arabic word images that contain 2,494,428 characters in total, which were taken from the Holy Quran. These finely annotated images were randomly divided into 90%, 5%, 5% sets for tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…Type of Content Availability Size of Dataset ACTIV2 [12] Embedded words Public 10,415 text images QTID [13] Synthetic words Private 309,720 words and 249,428 characters IFN/ENIT [14] Handwritten words Public 115,000 words and 212,000 characters AHDB [15] Handwritten words and digits Private 30,000 words APTI [16] Printed words Public 113,284 words and 648,280 characters HACDB [17] Handwritten characters Public 6600 characters and 50 writers UPTI [18] Printed text lines Public 10,000 text lines Digital Jawi [19] Jawi paleography images Public 168 words and 1524 characters KHATT [20] Handwritten text lines Public 9327 lines, 165,890 words and 589,924 characters ALIF [21] Embedded text lines Upon request 1804 words and 89,819 characters ACTIV [22] Embedded text lines Public 4824 lines and 21,520 words SmartATID [23] Printed and handwritten pages Public 9088 pages Degraded historical [24] Handwritten documents Public 10 handwritten images and 10 printed images Printed PAW [25] Printed subwords Upon request 415,280 unique words and 550,000 sub words Checks [26] Handwritten subwords and digits Private 29,498 subwords and 15,148 digits Numeral [27] Handwritten digits Public 21,120 digits and 44 writers Forms [28] Handwritten characters Private 15,800 characters and 500 writers KAFD [29] Printed pages and lines Public 28,767 pages and 644,006 lines AHDBIFTR [30] Handwritten images Public 497 word images and 5 writers ARABASE [31] Handwritten text Public 47,000 words and 500 free Arabic sentences CEDAR [32] Handwritten pages Private 20,000 words, 10 writers, and 100 documents CENPARMI [26] Handwritten subwords and digits Public 6000 digit images Shafi and Zia [33] surveyed automatic Urdu text recognition techniques and described the algorithms, techniques, datasets, challenges, and future directions for Urdu OCR. Additionally, [34] reviewed the availability of datasets and suggested more training data to address the unique challenges of OCR systems.…”
Section: Datasetmentioning
confidence: 99%
“…Type of Content Availability Size of Dataset ACTIV2 [12] Embedded words Public 10,415 text images QTID [13] Synthetic words Private 309,720 words and 249,428 characters IFN/ENIT [14] Handwritten words Public 115,000 words and 212,000 characters AHDB [15] Handwritten words and digits Private 30,000 words APTI [16] Printed words Public 113,284 words and 648,280 characters HACDB [17] Handwritten characters Public 6600 characters and 50 writers UPTI [18] Printed text lines Public 10,000 text lines Digital Jawi [19] Jawi paleography images Public 168 words and 1524 characters KHATT [20] Handwritten text lines Public 9327 lines, 165,890 words and 589,924 characters ALIF [21] Embedded text lines Upon request 1804 words and 89,819 characters ACTIV [22] Embedded text lines Public 4824 lines and 21,520 words SmartATID [23] Printed and handwritten pages Public 9088 pages Degraded historical [24] Handwritten documents Public 10 handwritten images and 10 printed images Printed PAW [25] Printed subwords Upon request 415,280 unique words and 550,000 sub words Checks [26] Handwritten subwords and digits Private 29,498 subwords and 15,148 digits Numeral [27] Handwritten digits Public 21,120 digits and 44 writers Forms [28] Handwritten characters Private 15,800 characters and 500 writers KAFD [29] Printed pages and lines Public 28,767 pages and 644,006 lines AHDBIFTR [30] Handwritten images Public 497 word images and 5 writers ARABASE [31] Handwritten text Public 47,000 words and 500 free Arabic sentences CEDAR [32] Handwritten pages Private 20,000 words, 10 writers, and 100 documents CENPARMI [26] Handwritten subwords and digits Public 6000 digit images Shafi and Zia [33] surveyed automatic Urdu text recognition techniques and described the algorithms, techniques, datasets, challenges, and future directions for Urdu OCR. Additionally, [34] reviewed the availability of datasets and suggested more training data to address the unique challenges of OCR systems.…”
Section: Datasetmentioning
confidence: 99%
“…The Quran Text Image Dataset (QTID) (Badry et al 2018) is the first Arabic dataset that includes Arabic marks (diacritics). It consists of 309,720-word images with a dimension of 192×64.…”
Section: Arabic Optical Character Recognition Datasetmentioning
confidence: 99%
“…Other datasets [8] [15] where such a system can figure out the characters and word on the lines. This paper introduces a public Quranic dataset on page and line level where the most unique about this dataset it focus on a diacritics text where only one dataset [11] introduce Quranic and diacritic dataset, unfortunately, this dataset is not public and it's synthetically generated on word level. In addition, what make this dataset more unique is, it based on Mushaf al Madinah benchmark which was written by the hand of Arabic calligraphy artist using the Uthmanic script.…”
Section: Related Workmentioning
confidence: 99%