2021
DOI: 10.48550/arxiv.2111.11011
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Abstract: The attention-based encoder-decoder framework is becoming popular in scene text recognition, largely due to its superiority in integrating recognition clues from both visual and semantic domains. However, recent studies show the two clues might be misaligned in the difficult text (e.g., with rare text shapes) and introduce constraints such as character position to alleviate the problem. Despite certain success, a content-free positional embedding hardly associates with meaningful local image regions stably. In… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 39 publications
0
5
0
1
Order By: Relevance
“…Additionally, classic methods for feature extraction using CNN, such as CRNN [38] and TRBA [1], were also included. Additionally, we incorporated the CDistNet method, which combines CNN and Transformer for joint feature extraction [52]. To serve as the baseline for this article, we introduced the PARSeq method [53], involving the joint learning of internal language models.…”
Section: Comparative Analysis With Existing Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Additionally, classic methods for feature extraction using CNN, such as CRNN [38] and TRBA [1], were also included. Additionally, we incorporated the CDistNet method, which combines CNN and Transformer for joint feature extraction [52]. To serve as the baseline for this article, we introduced the PARSeq method [53], involving the joint learning of internal language models.…”
Section: Comparative Analysis With Existing Methodsmentioning
confidence: 99%
“…Existing internal language joint learning methods typically employ the Transformer architecture and integrate corresponding language branches. SRN [51] employs a semantic reasoning network to assist in text recognition, while CDistNet [52] introduces positional query vectors to align visual and semantic features. PARSeq [53] achieves decoding for arbitrary methods by using Permutation Language Modeling, creating connections between arbitrary characters.…”
Section: Related Workmentioning
confidence: 99%
“…The model proposed by He et al [ 53 ] constructs a subgraph for each instance and trains it using graph convolutional network and cross-entropy loss function, which achieves good results in text recognition. Zheng et al proposed a new multi-domain character distance perception (MDCDP) module [ 54 ] to establish visually and semantically relevant position encoding, so as to improve the recognition position of the model. Cui et al proposed a representation and correlation enhanced encoder–decoder framework (RCEED) [ 55 ] to address these shortcomings and break through the performance bottleneck.…”
Section: Related Workmentioning
confidence: 99%
“…The NRTR algorithm [27] advocates full application of the Transformer architecture for image encoding and decoding, incorporating a simple convolution layer for feature extraction. CDistNet [39] developed a specialized module for location modeling to aid character decoding using accurate image features. Nonetheless, these techniques employ a step-by-step decoding approach, recognizing characters individually, leading to reduced speed.…”
Section: ) Transformer-based Approachmentioning
confidence: 99%