Previous research efforts on optical font recognition have mostly limited applications since they deal with only a few types of font attributes and estimate them from a line or block of text. This paper proposes a word-level optical font recognition system for printed Korean and English documents. At the word-level, it has the advantages of obtaining more detailed font attributes including the following: script (Korean and English), font style (regular, bold, italic, and underlined), typeface (Myung-jo and Gothic), point size (10, 12, 14 pts), and word length (2, 3, 4, 5 for Korean, and 4 to 10 for English). A hierarchical classifier and several typographical features have been devised for the system, and their effectiveness are proven by an experiment with a database of 100 sets of 264 font categories.
Abstract. The historical documents are valuable cultural heritages and sources for the study of history, social aspect and life at that time. The digitalization of historical documents aims to provide instant access to the archives for the researchers and the public, who had been endowed with limited chance due to maintenance reasons. However, most of these documents are not only written by hand in ancient Chinese characters, but also have complex page layouts. As a result, it is not easy to utilize conventional OCR(optical character recognition) system about historical documents even if OCR has received the most attention for several years as a key module in digitalization. We have been developing OCR-based digitalization system of historical documents for years. In this paper, we propose dedicated segmentation and rejection methods for OCR of Korean historical documents. Proposed recognition-based segmentation method uses geometric feature and context information with Viterbi algorithm. Rejection method uses Mahalanobis distance and posterior probability for solving out-of-class problem, especially. Some promising experimental results are reported.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.