Multiorientation scene text detection via coarse-to-fine supervision-based convolutional networks

Wang, Xihan; Xia, Zhaoqiang; Peng, Jinye; Feng, Xiaoyi

doi:10.1117/1.jei.27.3.033032

Cited by 7 publications

(3 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the era of deep learning, scene text detection has been developed relying on two types of methods, such as object detection 12 – 15 and semantic segmentation, 16 so that scene text detection can be divided into regression-based methods 4 , 5 , 17 – 22 and segmentation-based methods 6 , 23 – 31 At the same time, due to the special nature of the text, methods at different detection scales are proposed. Classified by the object scale of detection, scene text detection can be divided into component-level detection 4 – 6 , 17 – 19 , 23 – 27 and instance-level detection 20 …”

Section: Related Workmentioning

confidence: 99%

Human-like cognition: visual features grouping for hard-to-group text dataset

Li,

Liu,

Tao

et al. 2024

J. Electron. Imag.

View full text Add to dashboard Cite

Most existing arbitrary shape text detection methods employ connected components and text center lines for grouping text instances, which assume that texts in adjacent positions belong to the same instance. However, many hard-to-group scene texts are too complex to be effectively processed in this way. To address this challenge, we propose a novel scene text-spotting method that utilizes feature-based clustering inspired by human cognitive principles of text perception. Our approach involves first utilizing a character spotter to obtain the location and the transcription information of the characters. Then, a lightweight recognition network extracts the visual features of the characters by their locations. These visual features are then grouped into instances through a K-means-fuzzy-net, which explicitly model visual feature similarity to effectively group the nested text, the large-margin text, the continuous text, and the one with overlapping characters. Finally, the recognition results of text instances are processed by a word correction module to improve the overall accuracy and reduce the vulnerability of individual character detection. Additionally, we have contributed a hard-to-group text dataset. Experiments demonstrate the stateof-the-art performance of our method in addressing scenarios. Hard-to-group text dataset is available at: https://github.com/baggio321/Hard-to-Group-Text-Dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

Human-like cognition: visual features grouping for hard-to-group text dataset

Li,

Liu,

Tao

et al. 2024

J. Electron. Imag.

View full text Add to dashboard Cite

show abstract

“…It can play a crucial role in both general scene understanding and in specific applications, such as autonomous driving 1 and robotic navigation 2 . Significant progress has been made in STR-based research since the development of deep learning, 3 – 6 object detection, 7 , 8 and text detection 9 – 12 However, a range of circumstances, such as uneven illumination, poor image quality, perspective distortion, and orientation, continue to make recognizing texts from real images challenging.…”

Section: Introductionmentioning

confidence: 99%

“…2 Significant progress has been made in STR-based research since the development of deep learning, [3][4][5][6] object detection, 7,8 and text detection. [9][10][11][12] However, a range of circumstances, such as uneven illumination, poor image quality, perspective distortion, and orientation, continue to make recognizing texts from real images challenging.…”

Section: Introductionmentioning

confidence: 99%

Multilingual semantic fusion network for text recognition in the wild

et al. 2023

View full text Add to dashboard Cite

Most current approaches in the literature of scene text recognition train the language model via a text dataset far sparser than in natural language processing, resulting in inadequate training. Therefore, we propose a simple transformer encoder-decoder model called the multilingual semantic fusion network (MSFN) that can leverage prior linguistic knowledge to learn robust language features. First, we label the text dataset with forward, backward sequences, and subwords, which are extracted by tokenization with linguistic information. Then we introduce a multilingual model to the decoder corresponding to three different channels of the labeled dataset. The final output is fused by different channels to get more accurate results. In experiments, MSFN achieves cutting-edge performance across six benchmark datasets, and extensive ablative studies have proven the effectiveness of the proposed method. Code is available at https://github .com/lclee0577/MLViT.

show abstract