GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition

Hu, Wenyang; Cai, Xiaocong; Hou, Jun; Yi, Shuai; Lin, Zhiping

doi:10.1609/aaai.v34i07.6735

Cited by 111 publications

(54 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Shi et al [30] proposed the CTC-based method, where the visual feature extracted by CNN was reshaped as a sequence and then modeled by RNN and CTC loss. Following this pipeline, several methods were developed with improved accuracy [8,9,33]. Rather than decoding by RNN, segmentationbased methods [17,19,40] directly performed pixel-level character segmentation and prediction.…”

Section: Semantic-free Methodsmentioning

confidence: 99%

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Zheng¹,

Chen²,

Fang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The attention-based encoder-decoder framework is becoming popular in scene text recognition, largely due to its superiority in integrating recognition clues from both visual and semantic domains. However, recent studies show the two clues might be misaligned in the difficult text (e.g., with rare text shapes) and introduce constraints such as character position to alleviate the problem. Despite certain success, a content-free positional embedding hardly associates with meaningful local image regions stably. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visual and semantic related position encoding. MDCDP uses positional embedding to query both visual and semantic features following the attention mechanism. It naturally encodes the positional clue, which describes both visual and semantic distances among characters. We develop a novel architecture named CDistNet that stacks MDCDP several times to guide precise distance modeling. Thus, the visual-semantic alignment is well built even various difficulties presented. We apply CDistNet to two augmented datasets and six public benchmarks. The experiments demonstrate that CDis-tNet achieves state-of-the-art recognition accuracy. While the visualization also shows that CDistNet achieves proper attention localization in both visual and semantic domains. The code will be released in https://github.com/ simplify23/CDistNet.

show abstract

Section: Semantic-free Methodsmentioning

confidence: 99%

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Zheng¹,

Chen²,

Fang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…As can be seen from the results shown in Tab.4, DPAN achieves the best performance among all types of approaches. Note that GTC [8] uses additional text images for training. Even so, DPAN performs better than GTC on most datasets.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Look Back Again

Xie

Jin³

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Nowadays, it is a trend that using a parallel-decoupled encoderdecoder (PDED) framework in scene text recognition for its flexibility and efficiency. However, due to the inconsistent information content between queries and keys in the parallel positional attention module (PPAM) used in this kind of framework(queries: position information, keys: context and position information), visual misalignment tends to appear when confronting hard samples(e.g., blurred texts, irregular texts, or low-quality images). To tackle this issue, in this paper, we propose a dual parallel attention network (DPAN), in which a newly designed parallel context attention module (PCAM) is cascaded with the original PPAM, using linguistic contextual information to compensate for the information inconsistency between queries and keys. Specifically, in PCAM, we take the visual features from PPAM as inputs and present a bidirectional language model to enhance them with linguistic contexts to produce queries. In this way, we make the information content of the queries and keys consistent in PCAM, which helps to generate more precise visual glimpses to improve the entire PDED framework's accuracy and robustness. Experimental results verify the effectiveness of the proposed PCAM, showing the necessity of keeping the information consistency between queries and keys in the attention mechanism. On six benchmarks, including regular text and irregular text, the performance of DPAN surpasses the existing leading methods by large margins, achieving new state-of-the-art performance. The code is available on https://github.com/Jackandrome/DPAN. CCS CONCEPTS• Computing methodologies → Computer vision tasks.

show abstract

“…1.8% and 1.7% improvements are achieved on two irregular datasets IC15 and SVTP without any pre-processing like rectification. Compared with the CTC-based method GTC [23], our PIMNet outperforms it on all six benchmarks under the same setting of training data. Besides autoregressive guidance, our PIMNet also adopts an iterative easy first decoding strategy to extract context information and mimicking learning to improve the learning of the hidden layers, which is a further step.…”

Section: Comparisons With State-of-the-artsmentioning

confidence: 98%

“…1, PIMNet with mimicking learning achieves better accuracy, especially 1.1% on SVT and 1.1% on IC15. Note that the PIMNet without mimicking still retains the autoregressive decoder, which is similar to GTC [23]. Each pixel shows the cosine similarities 𝑐𝑜𝑠 𝑖 𝑗 of the i-th and j-th outputs.…”

Section: Ablation Studiesmentioning

confidence: 99%

PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition

Qiao

Zhou

Wei

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Nowadays, scene text recognition has attracted more and more attention due to its various applications. Most state-of-the-art methods adopt an encoder-decoder framework with attention mechanism, which generates text autoregressively from left to right. Despite the convincing performance, the speed is limited because of the one-by-one decoding strategy. As opposed to autoregressive models, non-autoregressive models predict the results in parallel with a much shorter inference time, but the accuracy falls behind the autoregressive counterpart considerably. In this paper, we propose a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency. Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster and an iterative generation mechanism to make the predictions more accurate. In each iteration, the context information is fully explored. To improve learning of the hidden layer, we exploit the mimicking learning in the training phase, where an additional autoregressive decoder is adopted and the parallel decoder mimics the autoregressive decoder with fitting outputs of the hidden layer. With the shared backbone between the two decoders, the proposed PIMNet can be trained end-to-end without pre-training. During inference, the branch of the autoregressive decoder is removed for a faster speed. Extensive experiments on public benchmarks demonstrate the effectiveness and efficiency of PIMNet. Our code is available in the supplementary material. CCS CONCEPTS• Applied computing → Optical character recognition.

show abstract

GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition

Cited by 111 publications

References 18 publications

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Look Back Again

PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition

Contact Info

Product

Resources

About