Abstract.A major requirement in the design of robust OCRs is the invariance of feature extraction scheme with the popular fonts used in the print. Many statistical and structural features have been tried for character classification in the past. In this paper, we get motivated by the recent successes in object category recognition literature and use a spatial extension of the histogram of oriented gradients (HOG) for character classification. Our experiments are conducted on 1453950 Telugu character samples in 359 classes and 15 fonts. On this data set, we obtain an accuracy of 96-98% with an SVM classifier.
This paper presents an XML-based scheme for managing a large multilingual OCR project. In particular we describe how a new XML based tagging scheme has been exploited to achieve the objectives of the project. Managing a large multi-lingual OCR project involving multiple research groups, developing script specific and script independent technologies in a collaborative fashion is a challenging problem. In this paper, we present some of the software and data management strategies designed for the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.
Though, Indian language OCRs have shown significant improvement in classification rates in recent years, recognition of degraded words still pose a big challenge for the development of robust OCR systems. Ours is an attempt to formulate the problem of degraded word recognition in a generic and formal structure. We formulate the problem of degraded word recognition as a probabilistic parsing problem. A probabilistic parsing based framework is used to rank and validate various possible hypotheses. We effectively combine it with an alternate word generator, symbol recognizer and verification unit to improve recognition rates of degraded words without compromising good characters. We demonstrate our method on Malayalam. We experiment our method on a complete annotated book, where around 65% of the degraded words are correctly recognized using this approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.