A shape based post processor for Gurmukhi OCR

Lehal, Gurpreet Singh; Singh, Chandan; Lehal, Ritu

doi:10.1109/icdar.2001.953957

Cited by 26 publications

(13 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The script is similar to Devnagari but simpler since compound characters are absent there. Although research on Devnagari OCR started 20 years ago, that on Gurumukhi script started only recently [79][80][81][82][83][84]119]. Lehal and Singh [79] developed a complete OCR system for printed Gurumukhi script where connected components are ÿrst segmented using a thinning based approach.…”

Section: Studies On Gurumukhi Character Recognitionmentioning

confidence: 99%

Indian script character recognition: a survey

Pal

Chaudhuri

2004

Pattern Recognition

461

178

View full text Add to dashboard Cite

Section: Studies On Gurumukhi Character Recognitionmentioning

confidence: 99%

Indian script character recognition: a survey

Pal

Chaudhuri

2004

Pattern Recognition

461

178

View full text Add to dashboard Cite

“…The most popular approach to handle poor recognition on degraded documents, was to use strong post-processing modules such as character error models [3], dictionaries [4], statistical language models [5], or a combination [6]. However, post-processing modules are not easy to construct for Indian languages due to large vocabulary size [7].…”

Section: Introductionmentioning

confidence: 99%

Robust Recognition of Degraded Documents Using Character N-Grams

Dutta

Sankaran

Sankar

et al. 2012

2012 10th IAPR International Workshop on Document Analysis Systems

View full text Add to dashboard Cite

Abstract-In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for postprocessing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.

show abstract

“…Along these lines, the flat projection of an archive picture is the most regularly utilized strategy to concentrate the lines from the report [13,14,15,16]. Provided that the lines are decently differentiated and not tilted, the flat projection will have overall divided tops and valleys [77].…”

Section: Line Wise Script Segmentationmentioning

confidence: 99%