Abstract. Visual saliency models have been introduced to the field of character recognition for detecting characters in natural scenes. Researchers believe that characters have different visual properties from their non-character neighbors, which make them salient. With this assumption, characters should response well to computational models of visual saliency. However in some situations, characters belonging to scene text mignt not be as salient as one might expect. For instance, a signboard is usually very salient but the characters on the signboard might not necessarily be so salient globally. In order to analyze this hypothesis in more depth, we first give a view of how much these background regions, such as sign boards, affect the task of saliency-based character detection in natural scenes. Then we propose a hierarchical-saliency method for detecting characters in natural scenes. Experiments on a dataset with over 3,000 images containing scene text show that when using saliency alone for scene text detection, our proposed hierarchical method is able to capture a larger percentage of text pixels as compared to the conventional single-pass algorithm.
This paper evaluates the degree of saliency of texts in natural scenes using visual saliency models. A large scale scene image database with pixel level ground truth is created for this purpose. Using this scene image database and five state-of-the-art models, visual saliency maps that represent the degree of saliency of the objects are calculated. The receiver operating characteristic curve is employed in order to evaluate the saliency of scene texts, which is calculated by visual saliency models. A visualization of the distribution of scene texts and non-texts in the space constructed by three kinds of saliency maps, which are calculated using Itti's visual saliency model with intensity, color and orientation features, is given. This visualization of distribution indicates that text characters are more salient than their non-text neighbors, and can be captured from the background. Therefore, scene texts can be extracted from the scene images. With this in mind, a new visual saliency architecture, named hierarchical visual saliency model, is proposed. Hierarchical visual saliency model is based on Itti's model and consists of two stages. In the first stage, Itti's model is used to calculate the saliency map, and Otsu's global thresholding algorithm is applied to extract the salient region that we are interested in. In the second stage, Itti's model is applied to the salient region to calculate the final saliency map. An experimental evaluation demonstrates that the proposed model outperforms Itti's model in terms of captured scene texts.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.