2021
DOI: 10.48550/arxiv.2111.15263
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Abstract: Linguistic knowledge has brought great benefits to scene text recognition by providing semantics to refine character sequences. However, since linguistic knowledge has been applied individually on the output sequence, previous methods have not fully utilized the semantics to understand visual clues for text recognition. This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances. Sp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 28 publications
0
2
0
Order By: Relevance
“…To benefit from the generated visual features following linguistic rules, increased research interests have been dedicated to using transformers for STR recently [4,24,25], where the encoder extracts visual features and the decoder predicts characters in images. Owing to their parallel self-attention and prediction mechanisms, transformers can overcome the difficulties of sequential inference with diverse scene text features to some extent [1].…”
Section: B Vision Transformers For Strmentioning
confidence: 99%
“…To benefit from the generated visual features following linguistic rules, increased research interests have been dedicated to using transformers for STR recently [4,24,25], where the encoder extracts visual features and the decoder predicts characters in images. Owing to their parallel self-attention and prediction mechanisms, transformers can overcome the difficulties of sequential inference with diverse scene text features to some extent [1].…”
Section: B Vision Transformers For Strmentioning
confidence: 99%
“…More specifically, they obtain OCR annotations from an open source OCR engine Tesseract [45] for 5 Million documents from IIT-CDIP [25] dataset. With the introduction of pre-training strategy and advances in modern OCR engine [1,12,20,28,34], many contemporary approaches [7,2,53] have utilized even more data to advance the Document Intelligence field.…”
Section: Introductionmentioning
confidence: 99%