An Adaptive Thresholding Algorithm-Based Optical Character Recognition System for Information Extraction in Complex Images

Akinbade, Daniel; Ogunde, Adewale Opeoluwa; Odim, Mba Obasi; Oguntunde, Bosede Oyenike

doi:10.3844/jcssp.2020.784.801

Cited by 11 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Haraj and Raissouni produced an average of 95.77% charcater accuracy using tesseract and opencv library over 4 sample images in 2015 [17]. Those research [14,15,16,17] only used relatively small samples (less than 50 documents), while our study used more documents (8,562 documents in 6 Categories and two document structures). Previous research [14,15,16,17], which also employed the Tesseract library, only used string matching to measure the OCR.…”

Section: Related Workmentioning

confidence: 83%

“…Similar research employed the Tesseract library [16] with only 11 images as input yielding 69.7% precision. On the other hand, our study produced 83.07% precision with 8,562 documents as the same library input.…”

Section: B Discussionmentioning

confidence: 99%

“…Kumar and friends produced 97% accuracy for small scanned bill documents and 83% accuracy for small scanned bill documents using Tesseract OCR on 25 scanned bills in 2020 [15]. Akinbade and friends produced 81.9% character accuracy and 69.7% word accuracy on 11 sample images in 2020 [16]. Haraj and Raissouni produced an average of 95.77% charcater accuracy using tesseract and opencv library over 4 sample images in 2015 [17].…”

Section: Related Workmentioning

confidence: 99%

“…Those research [14,15,16,17] only used relatively small samples (less than 50 documents), while our study used more documents (8,562 documents in 6 Categories and two document structures). Previous research [14,15,16,17], which also employed the Tesseract library, only used string matching to measure the OCR. On the other hand, our study used four measurements, i.e., conversion time, NER time, string match accuracy as precision, and the number of entities acquired as recall.…”

Section: Related Workmentioning

confidence: 99%

“…An offline desktop-based application called Foxit, an online-based application called PDF2GO, and an open-source OCR library called Tesseract were used to convert all documents. Patel et al in 2012 [14] produce 70% accuracy, Kumar et al in 2020 [15] produce 97% accuracy and, Akinbade [16] produce 81.9% accuracy using the Tesseract library on scanned documents.…”

Section: B Ocr Engines Preprocessingmentioning

confidence: 99%

See 4 more Smart Citations

Optical Character Recognition Engines Performance Comparison in Information Extraction

Ramdhani

Budi²,

Purwandari³

2021

IJACSA

View full text Add to dashboard Cite

Named Entity Recognition (NER) is often used to acquire important information from text documents as a part of the Information Extraction (IE) process. However, the text documents quality affects the accuracy of the data obtained, especially for text documents acquired involving the Optical Character Recognition (OCR) process, which never reached 100% accuracy. This research tried to examine which OCR engine with the highest performance for IE using NER by comparing three OCR engines (Foxit, PDF2GO, Tesseract) over 8,562 government human resources documents within six document categories, two document structures, and four measurements. Several essential entities such as name, employee ID, document number, document publishing date, employee rank, and family member's name were trying to be extracted automatically from the documents. NER processes were done using Python programming language, and the preprocessing tasks were done separately for Foxit, PDF2GO, and Tesseract. In summary, each OCR engine has its drawbacks and benefit, such as Tesseract has better NER extraction and conversion time with better accuracy but lack in the number of entities acquired.

show abstract