Scene text detection (STD) is an irreplaceable step in a scene text reading system. It remains a more challenging task than general object detection since text objects are of arbitrary orientations and varying sizes. Generally, segmentation methods that use U-Net or hourglass-like networks are the mainstream approaches in multi-oriented text detection tasks. However, experience has shown that text-like objects in the complex background have high response values on the output feature map of U-Net, which leads to the severe false positive detection rate and degrades the STD performance. To tackle this issue, an adaptive soft attention mechanism called contextual attention module (CAM) is devised to integrate into U-Net to highlight salient areas and meanwhile retains more detail information. Besides, the gradient vanishing and exploding problems make U-Net more difficult to train because of the nonlinear deconvolution layer used in the up-sampling process. To facilitate the training process, a gradient-inductive module (GIM) is carefully designed to provide a linear bypass to make the gradient back-propagation process more stable. Accordingly, an end-to-end trainable Gradient-Inductive Segmentation network with Contextual Attention is proposed (GISCA). The experimental results on three public benchmarks have demonstrated that the proposed GISCA achieves the state-of-the-art results in terms of f-measure: 92.1%, 87.3%, and 81.4% for ICDAR 2013, ICDAR 2015, and MSRA TD500, respectively. INDEX TERMS Scene text detection, multi-oriented text, segmentation network, contextual attention, gradient vanishing/exploding problems. I. INTRODUCTION Scene text is one of the most common objects in the nature, which frequently appears on many practical scenes and contains important information for many applications [1]-[4], such as blind navigation, scene understanding, autonomous driving, etc. Scene text reading is thus of great importance in computer vision which has made tremendous progress benefiting from the development of Deep Convolution Neural Networks (DCNNs) [5]. Generally, scene text reading can be divided into two main sub-tasks: scene text detection (STD) and scene text recognition. We focus on the task of STD in this paper since it's the prerequisite and crucial step of scene text The associate editor coordinating the review of this manuscript and approving it for publication was Jafar A. Alzubi. recognition. Despite the similarity to traditional OCR, it is noticed that STD is still challenging because text instances exhibit vast diversity in fonts, scales, arbitrary orientations, as well as uncontrollable illumination affects [6], etc. Thanks to the recent development of general object detection [7], [8] and instance segmentation methods [9], STD has witnessed a gigantic progress. Deep learning based methods for STD directly learn hierarchical features from input images, which demonstrates more accurate and efficient performance in various STD public benchmarks. However, due to the characteristics of scene text, the...