Primitive Representation Learning for Scene Text Recognition

Yan, Ruijie; Peng, Liangrui; Xiao, Shanyu; Yao, Gang

doi:10.1109/cvpr46437.2021.00035

“…Comparison with State-of-the-Art We compare the proposed S-GTR with state-of-the-art methods, and the results are summarized in Table 1, where the inference speed as well as the number of model parameters are also reported. As can be seen, the proposed S-GTR achieves the highest recognition accuracy and 3× faster inference speed compared with the second best method PREN2D (Yan et al 2021). In addition, when real data is utilized for training, S- GTR achieves more impressive results on all the six benchmarks, validating the effectiveness of the proposed GTR for textual reasoning and the benefit of real data.…”

Section: Performance Analysismentioning

confidence: 59%

“…To further verify the effectiveness of GTR, we plug our GTR module into four representative types of STR methods, including CTCbased method (e.g., CRNN (Shi, Bai, and Yao 2016)), 1D attention-based method (e.g., TRBA (Baek et al 2019)), 2D attention-based method (e.g., Base2D (Yan et al 2021)), and transformer-based methods (e.g., SRN (Yu et al 2020) and ABINet-LV (Fang et al 2021)). For the 1D attention-based method, the prediction result of VR is a 1D semantic vector.…”

Section: Plugging Gtr In Different Modelsmentioning

confidence: 99%

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

He¹,

Chen²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR can be easily plugged in representative STR models to improve their performance owing to better textual reasoning. Specifically, we construct our model, namely S-GTR, by paralleling GTR to the language model in a segmentation-based STR baseline, which can effectively exploit the visual-linguistic complementarity via mutual learning. S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets.

show abstract

“…Specifically, our model achieves superior performance improvements on SVT, IC13 L , IC15 S and SVTP (datasets contain low-quality images) by 1.1%∼1.7%. PREN2D [29] slightly wins on CUTE, but IterNet shows huge performance gains on all the other datasets: 1.2% on IIIT, 1.1% on SVT, 1.5% on IC13 S , 4.7% on IC15 S , and 3.3% on SVTP. It's worth noting that our IterNet uses the same iterative language modeling module as ABINet, but with a different vision modeling module (i.e., IterVM).…”

Section: Comparison To State-of-the-artsmentioning

confidence: 98%

“…VisionLAN [28] proposes language-aware visual masks for training, which simulates the case of missing character-wise visual semantics and guides the vision modeling module to use not only the visual texture of characters but also the linguistic information in visual context for recognition. PREN2D [29] proposes global feature aggregations to learn primitive visual representations from multi-scale feature maps and exploits GCNs to transform primitive representations into high-level visual text representations. Different from these works, our IterVM uses feedback connections to fuse high-level (the most semantic) visual feature with multi-level visual features.…”

Section: Visual Feature Enhancement By Semantic Informationmentioning

confidence: 99%

IterVM: Iterative Vision Modeling Module for Scene Text Recognition

Chu¹,

Wang²

2022

Preprint

0

View full text Add to dashboard Cite

“…Specifically, it models the sliced visual features as the graph nodes, captures their dependency, and merges features of the same instance for prediction. PREN2D (Yan et al 2021) adopts a meta-learning framework to extract visual representations via GCN. In this paper, we devise a two-level graph network based on GCN to perform spatial context reasoning within and between character instances to refine the visual recognition results.…”

Section: Related Workmentioning

confidence: 99%

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

He

¹

,

Chen

²

,

Zhang

³

et al. 2022

AAAI

View full text Add to dashboard Cite

Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR can be easily plugged in representative STR models to improve their performance owing to better textual reasoning. Specifically, we construct our model, namely S-GTR, by paralleling GTR to the language model in a segmentation-based STR baseline, which can effectively exploit the visual-linguistic complementarity via mutual learning. S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Code is available at https://github.com/adeline-cs/GTR.

show abstract

Primitive Representation Learning for Scene Text Recognition

Cited by 71 publications

References 35 publications

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

IterVM: Iterative Vision Modeling Module for Scene Text Recognition

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Contact Info

Product

Resources

About