2022
DOI: 10.1609/aaai.v36i1.19971
|View full text |Cite
|
Sign up to set email alerts
|

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Abstract: Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 39 publications
(11 citation statements)
references
References 29 publications
0
11
0
Order By: Relevance
“…[95] proposed a multi‐stage and multi‐scale decoder that jointly carries out visual and semantic reasoning, with recurrent stages refining the results. Lastly, [96] proposes a graph‐based solution that groups pixels in a character instance based on their location similarity, achieving state‐of‐the‐art results.…”
Section: Text Spottingmentioning
confidence: 99%
See 1 more Smart Citation
“…[95] proposed a multi‐stage and multi‐scale decoder that jointly carries out visual and semantic reasoning, with recurrent stages refining the results. Lastly, [96] proposes a graph‐based solution that groups pixels in a character instance based on their location similarity, achieving state‐of‐the‐art results.…”
Section: Text Spottingmentioning
confidence: 99%
“…The performance of the recognition methods can be seen in Table 4. The most prominent method in ICDAR 2013 dataset is S‐GTR [96] with its graph‐based convolutional network and visual semantic knowledge, obtaining an accuracy of 97.80%. On ICDAR 2015, this method was also the most successful approach, obtaining a score of 87.30%.…”
Section: Text Spottingmentioning
confidence: 99%
“…VST [35] extracts semantics from a visual feature map and performs a visual-semantic interaction using an alignment module. Most recently, some language-aware models explore more possibilities to boost the performance by considering spatial context [13], adding real-world images to train the model in a semi-supervised way [1], or developing a re-ranking method to get a better candidate output [30], while our work takes a step by extracting both the explicit and the implicit semantics.…”
Section: Related Workmentioning
confidence: 99%
“…Unlike traditional text recognition methods, deep learning methods [5][6][7][8][9][10][11] perform a top-down approach to recognize words or text lines directly, showing more promising results. Recent approaches [4,12] treat the STR problem as a sequence prediction problem, resulting in increased precision while eliminating the need for character-level annotation.…”
Section: Introductionmentioning
confidence: 99%