There are two main techniques to convert written or printed text into digital format. The first technique is to create an image of written/printed text, but images are large in size so they require huge memory space to store, as well as text in image form cannot be undergo further processes like edit, search, copy, etc. The second technique is to use an Optical Character Recognition (OCR) system. OCR’s can read documents and convert manual text documents into digital text and this digital text can be processed to extract knowledge. A huge amount of Urdu language’s data is available in handwritten or in printed form that needs to be converted into digital format for knowledge acquisition. Highly cursive, complex structure, bi-directionality, and compound in nature, etc. make the Urdu language too complex to obtain accurate OCR results. In this study, supervised learning-based OCR system is proposed for Nastalique Urdu language. The proposed system evaluations under a variety of experimental settings apprehend 98.4% training results and 97.3% test results, which is the highest recognition rate ever achieved by any Urdu language OCR system. The proposed system is simple to implement especially in software front of OCR system also the proposed technique is useful for printed text as well as handwritten text and it will help in developing more accurate Urdu OCR’s software systems in the future.
Optical character recognition systems convert printed or handwritten scripts into digital text formats like ASCII or UNICODE. Urdu-like script languages like Urdu, Punjabi and Sindhi are widely spoken languages of the world, especially in Asia. An enormous amount of printed and handwritten text of such languages exist, which needs to be converted into computer-understandable formats for knowledge extraction. In this study, extreme learning machine’s (ELM’s) most recently proposed variant called deep extreme learning machine (DELM)-based optical character recognition (OCR) system is proposed to enhance Urdu-like script language’s character recognition rate. The proposed DELM-based character recognition model is optimizing the OCR process by reducing the overhead of Pre-processing, Segmentation and Feature Extraction Layer. The proposed system evaluations accomplished 98.75% training accuracy with 1.492 × 10−3 RMSE and 98.12% testing accuracy with 1.587 × 10−3 RMSE, with six DELM hidden layers. The results show that the proposed system has attained the foremost recognition rate as compared to any previously proposed Urdu-like script language OCR system. This technique is applicable for machine-printed text and fractionally useful for handwritten text as well. This study will aid in the advancement of more accurate Urdu-like script OCR’s software systems in the future.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.