Mask R-CNN With Pyramid Attention Network for Scene Text Detection

Huang, Zhida; Zhong, Zhuoyao; Sun, Lei; Huo, Qiang

doi:10.1109/wacv.2019.00086

Cited by 92 publications

(54 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Alternatively, attention mechanisms have been widely studied in deep CNNs for many computer vision tasks in order to efficiently integrate local and global features, including human pose estimation [12], emotion recognition [13], text detection [14], object detection [15] and classification [16]. Unlike standard multi-scale features fusion approaches, which compress an entire image into a static representation, attention allows the network to focus on the most relevant features without additional supervision, avoiding the use of multiple similar feature maps and highlighting salient features that are useful for a given task.…”

Section: Introductionmentioning

confidence: 99%

Multi-Scale Self-Guided Attention for Medical Image Segmentation

Sinha

Dolz

2021

IEEE J. Biomed. Health Inform.

444

232

View full text Add to dashboard Cite

Even though convolutional neural networks (CNNs) are driving progress in medical image segmentation, standard models still have some drawbacks. First, the use of multi-scale approaches, i.e., encoder-decoder architectures, leads to a redundant use of information, where similar low-level features are extracted multiple times at multiple scales. Second, long-range feature dependencies are not efficiently modeled, resulting in nonoptimal discriminative feature representations associated with each semantic class. In this paper we attempt to overcome these limitations with the proposed architecture, by capturing richer contextual dependencies based on the use of guided self-attention mechanisms. This approach is able to integrate local features with their corresponding global dependencies, as well as highlight interdependent channel maps in an adaptive manner. Further, the additional loss between different modules guides the attention mechanisms to neglect irrelevant information and focus on more discriminant regions of the image by emphasizing relevant feature associations. We evaluate the proposed model in the context of abdominal organ segmentation on magnetic resonance imaging (MRI). A series of ablation experiments support the importance of these attention modules in the proposed architecture. In addition, compared to other state-of-the-art segmentation networks our model yields better segmentation performance, increasing the accuracy of the predictions while reducing the standard deviation. This demonstrates the efficiency of our approach to generate precise and reliable automatic segmentations of medical images. Our code and the trained model are made publicly available at: https://github.com/sinAshish/Multi-Scale-Attention

show abstract

Section: Introductionmentioning

confidence: 99%

Multi-Scale Self-Guided Attention for Medical Image Segmentation

Sinha

Dolz

2021

IEEE J. Biomed. Health Inform.

444

232

View full text Add to dashboard Cite

show abstract

“…TextContourNet [43] extract instance-level text contour to increase the accuracy of curve text detection. In [44], to enhance the feature representation ability, a pyramid attention network is used text detection tasks. Mask-Most Net [45] use instance-level mask approximation method through a combination of high-level semantic and low-level features.…”

Section: A Scene Text Detectionmentioning

confidence: 99%

Cluttered TextSpotter: An End-to-End Trainable Light-Weight Scene Text Spotter for Cluttered Environment

2020

View full text Add to dashboard Cite

Scene text spotting aims at simultaneously localizing and recognizing text instances, symbols, and logos in natural scene images. Scene text detection and recognition approaches have received immense attention in computer vision research community. The presence of partial occlusion or truncation artifact due to the cluttered background of scene images creates an obstacle in perceiving the text instances, which makes the process of spotting very complex. In this paper, we propose a lightweight scene text spotter that can address the issue of cluttered environment of scene images. It is an end-to-end trainable deep neural network that uses local part information, global structural features, and context cue information of oriented region proposals for spotting text instances. It helps to localize in scene images with background clutters, where partially occluded text parts, truncation artifacts, and perspective distortions are present. We mitigate the problem of misclassification caused by inter-class interference by exploring inter-class separability and intra-class compactness. We also incorporate multi-language character segmentation and word-level recognition in a lightweight recognition module. We have used six publicly available benchmark datasets in different smart devices to illustrate the efficacy of the network.

show abstract

“…SSTD [31] takes the pixel-wise binary mask of text images as an additional input, which is used to refine the original feature map. Huang et al incorporate the proposed Pyramid Attention Network (PAN) into the Mask R-CNN framework to enhance the text detection performance [32]. R 2 CNN ++ [33] adopts a multi-dimensional attention mechanism which contains both pixel-wise and channel-wise attention to reduce the adverse impact of noise.…”

Section: B Attention Mechanismmentioning

confidence: 99%

GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection

Cao¹,

Zou²,

Yang³

et al. 2019

IEEE Access

View full text Add to dashboard Cite

Scene text detection (STD) is an irreplaceable step in a scene text reading system. It remains a more challenging task than general object detection since text objects are of arbitrary orientations and varying sizes. Generally, segmentation methods that use U-Net or hourglass-like networks are the mainstream approaches in multi-oriented text detection tasks. However, experience has shown that text-like objects in the complex background have high response values on the output feature map of U-Net, which leads to the severe false positive detection rate and degrades the STD performance. To tackle this issue, an adaptive soft attention mechanism called contextual attention module (CAM) is devised to integrate into U-Net to highlight salient areas and meanwhile retains more detail information. Besides, the gradient vanishing and exploding problems make U-Net more difficult to train because of the nonlinear deconvolution layer used in the up-sampling process. To facilitate the training process, a gradient-inductive module (GIM) is carefully designed to provide a linear bypass to make the gradient back-propagation process more stable. Accordingly, an end-to-end trainable Gradient-Inductive Segmentation network with Contextual Attention is proposed (GISCA). The experimental results on three public benchmarks have demonstrated that the proposed GISCA achieves the state-of-the-art results in terms of f-measure: 92.1%, 87.3%, and 81.4% for ICDAR 2013, ICDAR 2015, and MSRA TD500, respectively. INDEX TERMS Scene text detection, multi-oriented text, segmentation network, contextual attention, gradient vanishing/exploding problems. I. INTRODUCTION Scene text is one of the most common objects in the nature, which frequently appears on many practical scenes and contains important information for many applications [1]-[4], such as blind navigation, scene understanding, autonomous driving, etc. Scene text reading is thus of great importance in computer vision which has made tremendous progress benefiting from the development of Deep Convolution Neural Networks (DCNNs) [5]. Generally, scene text reading can be divided into two main sub-tasks: scene text detection (STD) and scene text recognition. We focus on the task of STD in this paper since it's the prerequisite and crucial step of scene text The associate editor coordinating the review of this manuscript and approving it for publication was Jafar A. Alzubi. recognition. Despite the similarity to traditional OCR, it is noticed that STD is still challenging because text instances exhibit vast diversity in fonts, scales, arbitrary orientations, as well as uncontrollable illumination affects [6], etc. Thanks to the recent development of general object detection [7], [8] and instance segmentation methods [9], STD has witnessed a gigantic progress. Deep learning based methods for STD directly learn hierarchical features from input images, which demonstrates more accurate and efficient performance in various STD public benchmarks. However, due to the characteristics of scene text, the...

show abstract

Mask R-CNN With Pyramid Attention Network for Scene Text Detection

Cited by 92 publications

References 40 publications

Multi-Scale Self-Guided Attention for Medical Image Segmentation

Multi-Scale Self-Guided Attention for Medical Image Segmentation

Cluttered TextSpotter: An End-to-End Trainable Light-Weight Scene Text Spotter for Cluttered Environment

GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection

Contact Info

Product

Resources

About