Fisheye Images have attracted increasing attention from the research community due to their large field of view (LFOV). However, the geometric transformations inherent in fisheye cameras result in unknown spatial distortion and large variations in the appearance of objects. And this fact leads to poor performance of the state-of-the-art methods in conventional two-dimensional (2D) images. To address this problem, we propose a self-study and contour-based object detector in fisheye images, named FisheyeDet. The No-prior Fisheye Representation Method is proposed to guarantee that the network adaptively extracts distortion features without prior information such as prespecified lens parameters, special calibration patterns, etc. Furthermore, in order to tightly and robustly localize objects in fisheye images, the Distortion Shape Matching strategy is proposed, which invokes the irregular quadrilateral bounding boxes based on the contour of distorted objects as the core. By combining with the ''No-prior Fisheye Representation Method'' and ''Distortion Shape Matching'', our proposed detector builds an end-to-end network. Finally, due to the lack of public fisheye datasets, we are on the first attempt to create a multi-class fisheye dataset VOC-Fisheye for object detection. Our proposed detector shows favorable generalization ability and achieves 74.87% mAP (mean average precision) on the VOC-Fisheye, outperforming the existing state-of-the-art methods.INDEX TERMS Fisheye, object detection and recognition, large field of view (LFOV), deep learning.
In recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it difficult for the models to adequately extract spatiotemporal features from the video data. In this paper, an action recognition method based on improved residual convolutional neural networks (CNNs) for video frames and spatial attention modules is proposed to address this problem. The network can guide what and where to emphasize or suppress with essentially little computational cost using the video frame attention module and the spatial attention module. It also employs a two-level attention module to emphasize feature information along the temporal and spatial dimensions, respectively, highlighting the more important frames in the overall video sequence and the more important spatial regions in some specific frames. Specifically, we create the video frame and spatial attention map by successively adding the video frame attention module and the spatial attention module to aggregate the spatial and temporal dimensions of the intermediate feature maps of the CNNs to obtain different feature descriptors, thus directing the network to focus more on important video frames and more contributing spatial regions. The experimental results further show that the network performs well on the UCF-101 and HMDB-51 datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.