2022
DOI: 10.48550/arxiv.2211.04785
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Masked Vision-Language Transformers for Scene Text Recognition

Abstract: Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretrain… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 32 publications
0
1
0
Order By: Relevance
“…Subsequently, the final output undergoes conditional probability estimation, unfolding as a crucial aspect of the decoding mechanism. Building upon the output after layer normalization, the model proceeds to compute the conditional probability distribution over the vocabulary for the subsequent token 32 . This involves employing the softmax function, transforming raw scores into a probability distribution.…”
Section: Masked Multi-head Attention Of Decodersmentioning
confidence: 99%
“…Subsequently, the final output undergoes conditional probability estimation, unfolding as a crucial aspect of the decoding mechanism. Building upon the output after layer normalization, the model proceeds to compute the conditional probability distribution over the vocabulary for the subsequent token 32 . This involves employing the softmax function, transforming raw scores into a probability distribution.…”
Section: Masked Multi-head Attention Of Decodersmentioning
confidence: 99%