2013
DOI: 10.1117/12.2003731
|View full text |Cite
|
Sign up to set email alerts
|

A segmentation-free approach to Arabic and Urdu OCR

Abstract: In this paper, we present a generic Optical Character Recognition system for Arabic script languages called Nabocr. Nabocr uses OCR approaches specific for Arabic script recognition. Performing recognition on Arabic script text is relatively more difficult than Latin text due to the nature of Arabic script, which is cursive and context sensitive. Moreover, Arabic script has different writing styles that vary in complexity. Nabocr is initially trained to recognize both Urdu Nastaleeq and Arabic Naskh fonts. How… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
76
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 89 publications
(76 citation statements)
references
References 9 publications
0
76
0
Order By: Relevance
“…Multilayer perceptron, CNN, Recurrent Neural Networks (RNN), LSTM and its variations are extensively used for OCR [9]- [13]. Raw images of UPTI [1] data set are used to train Multi-dimensional LSTM [13]. For the same data set, CNN is also trained and 98.1% accuracy is achieved.…”
Section: Related Workmentioning
confidence: 99%
“…Multilayer perceptron, CNN, Recurrent Neural Networks (RNN), LSTM and its variations are extensively used for OCR [9]- [13]. Raw images of UPTI [1] data set are used to train Multi-dimensional LSTM [13]. For the same data set, CNN is also trained and 98.1% accuracy is achieved.…”
Section: Related Workmentioning
confidence: 99%
“…Segmentation-free techniques tend to be less complex than segmentation-based approaches in the sense that they do not require segmentation of text into individual characters. These methods are relatively easier to implement but the major challenge with these approaches is the larger number of classes to be recognized [21], [22], [4], [23], [24], [25]. This number is the same as the number of unique words (ligatures) in the vocabulary under study.…”
Section: Introductionmentioning
confidence: 99%
“…A set of rules is defined to associate the diacritics with the main body and recognize the complete ligature. Another segmentation-free recognition system for Urdu ligatures is presented in [24] where a set of shape descriptors is used to characterize the ligatures. A total of 10,000 Urdu ligatures in Nastaliq font and 20,000 Arabic ligatures in the Naskh font are used as training data.…”
Section: Introductionmentioning
confidence: 99%
“…The recognition of primary ligatures and dots/diacritics is carried out separately which are associated later to form the complete/true ligature. As opposed to conventional methods that either carry out recognition of detached characters [6][7][8][9] or work with single font size [5,10], the developed methodology recognizes ligatures irrespective of the font size. The training data prepared through the sequential clustering algorithm is expendable making the platform scalable for incorporation of more ligatures.…”
Section: Introductionmentioning
confidence: 99%
“…1), are employed as recognition units in the proposed technique. A semi-automatic algorithm is selected for clustering multiple occurrences of high-frequency ligatures (HFLs) extracted from the well-known Urdu Printed Text Images (UPTI) dataset [5]. Hidden Markov Models (HMMs) are employed for recognition, a separate HMM is trained for each ligature cluster.…”
Section: Introductionmentioning
confidence: 99%