2017
DOI: 10.1186/s13640-017-0208-z
|View full text |Cite
|
Sign up to set email alerts
|

Segmentation-free optical character recognition for printed Urdu text

Abstract: This paper presents a segmentation-free optical character recognition system for printed Urdu Nastaliq font using ligatures as units of recognition. The proposed technique relies on statistical features and employs Hidden Markov Models for classification. A total of 1525 unique high-frequency Urdu ligatures from the standard Urdu Printed Text Images (UPTI) database are considered in our study. Ligatures extracted from text lines are first split into primary (main body) and secondary (dots and diacritics) ligat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 36 publications
(11 citation statements)
references
References 40 publications
(92 reference statements)
0
11
0
Order By: Relevance
“…Their proposed system is trained on the UPTI dataset using Multidimensional-LSTM Recurrent Network that has attained 98% of accuracy on Nastaleeq Urdu Font. To recognize Urdu text, Israr Ud Din et al [2] presented a holistic approach for the recognition of printed Urdu text in Nastaleeq font. They have extracted 9 different statistical features with cumulative dimensionality of 116 for each sub-word image using a sliding window from left to right.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Their proposed system is trained on the UPTI dataset using Multidimensional-LSTM Recurrent Network that has attained 98% of accuracy on Nastaleeq Urdu Font. To recognize Urdu text, Israr Ud Din et al [2] presented a holistic approach for the recognition of printed Urdu text in Nastaleeq font. They have extracted 9 different statistical features with cumulative dimensionality of 116 for each sub-word image using a sliding window from left to right.…”
Section: Literature Reviewmentioning
confidence: 99%
“…There are a limited number of benchmark datasets for Perso-Arabic scripts. Some of them have been presented here: UPTI: Urdu Printed Text Image dataset, used by [19], [20], [24], [26] [30]. Although the first dataset of its kind till that time, this handwritten offline dataset has a limited number of data samples, 44 individual characters, and 57 Urdu words, focusing on one field mostly, the financial terms.…”
Section: Perso-arabic Datasetsmentioning
confidence: 99%
“…Due to challenges already discussed, implicit segmentation‐based techniques have remained a popular choice of researchers [7275]. Likewise, in the case of holistic approaches, ligatures have been typically employed as recognition units [76].…”
Section: Related Workmentioning
confidence: 99%
“…These techniques use the sliding windows to extract features from ligature images which are projected in the quantised feature space hence representing each ligature image as a sequence. In some cases, the main body and dots are separately recognised [76] to reduce the total number of unique classes which can be very high in case of Urdu text (Urdu has more than 26,000 unique ligatures [84]). A number of holistic techniques are based on word spotting [85, 86] rather than recognition, to retrieve documents containing words similar to those provided as a query.…”
Section: Related Workmentioning
confidence: 99%