2022
DOI: 10.1007/978-3-031-19815-1_28
|View full text |Cite
|
Sign up to set email alerts
|

Pure Transformer with Integrated Experts for Scene Text Recognition

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 44 publications
0
5
0
Order By: Relevance
“…, SRN [8], Vision-LAN [55], PARSeq [57], ABINet++ [104], LevOCR [56], MA-TRN [10] and MGP-STR) perform better than language-free methods, showing the significance of linguistic information. PTIE [51], which utilizes a transformer-only model with multiple patch resolutions, also achieves good results. Notably, owing to the multi-granularity predictions, MGP-STR † CF S has already outperformed the recent stateof-the-art method MATRN.…”
Section: Results On Standard Benchmarksmentioning
confidence: 95%
See 2 more Smart Citations
“…, SRN [8], Vision-LAN [55], PARSeq [57], ABINet++ [104], LevOCR [56], MA-TRN [10] and MGP-STR) perform better than language-free methods, showing the significance of linguistic information. PTIE [51], which utilizes a transformer-only model with multiple patch resolutions, also achieves good results. Notably, owing to the multi-granularity predictions, MGP-STR † CF S has already outperformed the recent stateof-the-art method MATRN.…”
Section: Results On Standard Benchmarksmentioning
confidence: 95%
“…Inspired by the great success of Transformer [49] in natural language processing (NLP) tasks, the application of Transformer in STR has also attracted more attention [50], [51], [52]. Vision Transformer (ViT) [11] that directly processes image patches without convolutions opens the beginning of using Transformer blocks instead of CNNs to solve computer vision problems [53], [54], leading to prominent results.…”
Section: Language-free Str Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…However, the above approaches utilize a fixed patch resolution when segmenting with ViT, which can negatively impact text images of specific word lengths and scaling scales, [16] proposed PTIE (Pure Transformer with Integrated Experts), a pure Transformer model that accommodates different patch resolutions and decodes in both the original and reverse character orders. MRP-STR [17] presented a custom Adaptive Addressing and Aggregation (A3) module that selects meaningful token combinations from ViT and integrates them into an output token corresponding to a specific character, named the Character A3 module.…”
Section: Transformer Based Methodsmentioning
confidence: 99%
“…Biten et al [27] propose a layer-aware transformer with a pre-training scheme on the basis of text and spatial cues only and show that it works well on scanned documents to handle multimodality in scene text visual question answering. Based on ViT [4], Tan et al [28] propose a mixture experts of pure transformers for processing different resolutions for scene text recognition. Atienza [5] presents a new transformer, called ViTSTR, for scent text recognition, which only uses the encoder architecture and emphasizes the balance between the performance and computational efficiency.…”
Section: B Vision Transformers For Strmentioning
confidence: 99%