Segmenting characters in an image is a classic yet challenging task in computer vision. Correctly determining boundaries of adhesive characters with various scales and shapes is essential for character segmentation, especially for separating handwritten characters. Nevertheless, there is seldom work in the literature which can achieve satisfactory performance. In this article, by leveraging the ability of deep neural networks, we proposed a two-stage character segmentation network with two-stream attention and edge refinement (TSER) to tackle this problem. TSER firstly locates every character by object detection, then extracts their corresponding contours. In the process, a novel two-stream attention mechanism (TSAM) is proposed to make the network focus more on the discrepancy of character boundaries. Furthermore, a novel generating method is used to dynamically generate anchors on different feature levels to improve model's sensitivity on the shapes and scales of characters. Eventually a cascaded edge refinement network is used to obtain contour of each character. To prove the efficiency and generalization ability of our model, we compared TSER with traditional algorithms and other deep learning models on two commonly used datasets in different segmentation tasks. The comparative result indicated that TSER reached state-of-the-art performance.
Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.