What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels

Baek, Jeonghun; Matsui, Yusuke; Aizawa, Kiyoharu

doi:10.1109/cvpr46437.2021.00313

Cited by 84 publications

(35 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The existing methods [3,4] do not recognize the characters correctly, while the proposed method reports correct recognition results. In addition, We have calculated the average FPS of the proposed and existing methods [3,4] for all the 8 datasets and the results are 6.76, 6.68 and 7.58 for the methods [3], [4] and SGBANet, respectively. This shows that our method is faster than the existing methods.…”

Section: Comparison With State-of-the-art Approachesmentioning

confidence: 86%

See 1 more Smart Citation

SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition

Zhong¹,

Lyu²,

Shivakumara³

et al. 2022

Preprint

View full text Add to dashboard Cite

Scene text recognition is a challenging task due to the complex backgrounds and diverse variations of text instances. In this paper, we propose a novel Semantic GAN and Balanced Attention Network (SGBANet) to recognize the texts in scene images. The proposed method first generates the simple semantic feature using Semantic GAN and then recognizes the scene text with the Balanced Attention Module. The Semantic GAN aims to align the semantic feature distribution between the support domain and target domain. Different from the conventional image-to-image translation methods that perform at the image level, the Semantic GAN performs the generation and discrimination on the semantic level with the Semantic Generator Module (SGM) and Semantic Discriminator Module (SDM). For target images (scene text images), the Semantic Generator Module generates simple semantic features that share the same feature distribution with support images (clear text images). The Semantic Discriminator Module is used to distinguish the semantic features between the support domain and target domain. In addition, a Balanced Attention Module is designed to alleviate the problem of attention drift. The Balanced Attention Module first learns a balancing parameter based on the visual glimpse vector and semantic glimpse vector, and then performs the balancing operation for obtaining a balanced glimpse vector. Experiments on six benchmarks, including regular datasets, i.e., IIIT5K, SVT, ICDAR2013, and irregular datasets, i.e., ICDAR2015, SVTP, CUTE80, validate the effectiveness of our proposed method.

show abstract

Section: Comparison With State-of-the-art Approachesmentioning

confidence: 86%

“…It can be observed from Fig. 1 that the existing methods [4,11] do not recognize the characters correctly for arbitrarily shaped text and text with complex backgrounds. Therefore, designing a robust method for recognizing arbitrarily shaped text is still a challenging task that remains to be solved.…”

Section: Introductionmentioning

confidence: 97%

SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition

Zhong¹,

Lyu²,

Shivakumara³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…As the thriving of deep learning, the researchers also made attempts to build the text recognition models based on deep neural networks following the bottom-up fashion [2,3,5,6,22,23,29,35,38,[40][41][42][43]46,47,51,59,60,65,67,71,73,83,87,90,98,100]. For example, CRNN [59] utilizes the CNN-RNN architecture to extract features for the text images, which are further supervised with the CTC loss [24] to maximize the probability of the ground truth.…”

Section: Existing Text Recognition Methodsmentioning

confidence: 99%

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Chen¹,

Yu²,

Ma³

et al. 2021

Preprint

View full text Add to dashboard Cite

The flourishing blossom of deep learning has witnessed the rapid development of text recognition in recent years. However, the existing text recognition methods are mainly for English texts, whereas ignoring the pivotal role of Chinese texts. As another widely-spoken language, Chinese text recognition in all ways has extensive application markets. Based on our observations, we attribute the scarce attention on Chinese text recognition to the lack of reasonable dataset construction standards, unified evaluation methods, and results of the existing baselines. To fill this gap, we manually collect Chinese text datasets from publicly available competitions, projects, and papers, then divide them into four categories including scene, web, document, and handwriting datasets. Furthermore, we evaluate a series of representative text recognition methods on these datasets with unified evaluation methods to provide experimental results. By analyzing the experimental results, we surprisingly observe that state-of-the-art baselines for recognizing English texts cannot perform well on Chinese scenarios. We consider that there still remain numerous challenges under exploration due to the characteristics of Chinese texts, which are quite different from English texts. The code and datasets are made publicly available at https://github.com/ FudanVI/benchmarking-chinese-text-recognition. Figure 1. Three reasons for the scarce attention of Chinese text recognition. (a) People may use different ways to crop text regions, which leads to unfair comparison. (b) It is necessary to specify the equivalence between lowercase and uppercase, half-width and full-width, simplified and traditional characters. (c) The existing methods are mainly evaluated with English datasets rather than Chinese datasets.

show abstract

“…In this work, we adopt STR public datasets to evaluate the performance of the pre-trained model. The datasets cover A New Dataset: UTI-100M As suggested in literature (Baek, Matsui, and Aizawa 2021), training model on real data can yield better results than synthetic data. Therefore, we collect a large-scale real dataset containing about 100 million unlabeled text line images, named Unlabeled Text Image 100M (UTI-100M), to explore the potential of the proposed hierarchical contrastive learning paradigm.…”

Section: Datasets and Metricsmentioning

confidence: 99%

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Líu

Wang

Bao

et al. 2022

AAAI

View full text Add to dashboard Cite

We introduce Perceiving Stroke-Semantic Context (PerSec), a new approach to self-supervised representation learning tailored for Scene Text Recognition (STR) task. Considering scene text images carry both visual and semantic properties, we equip our PerSec with dual context perceivers which can contrast and learn latent representations from low-level stroke and high-level semantic contextual spaces simultaneously via hierarchical contrastive learning on unlabeled text image data. Experiments in un- and semi-supervised learning settings on STR benchmarks demonstrate our proposed framework can yield a more robust representation for both CTC-based and attention-based decoders than other contrastive learning methods. To fully investigate the potential of our method, we also collect a dataset of 100 million unlabeled text images, named UTI-100M, covering 5 scenes and 4 languages. By leveraging hundred-million-level unlabeled data, our PerSec shows significant performance improvement when fine-tuning the learned representation on the labeled data. Furthermore, we observe that the representation learned by PerSec presents great generalization, especially under few labeled data scenes.

show abstract

What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels

Cited by 84 publications

References 50 publications

SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition

SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Contact Info

Product

Resources

About