Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset.Preprint. Under review.
In this paper, we propose two multiple-frame super-resolution (SR) algorithms based on dictionary learning (DL) and motion estimation. First, we adopt the use of video bilevel DL, which has been used for single-frame SR. It is extended to multiple frames by using motion estimation with sub-pixel accuracy. We propose a batch and a temporally recursive multi-frame SR algorithm, which improves over single-frame SR. Finally, we propose a novel DL algorithm utilizing consecutive video frames, rather than still images or individual video frames, which further improves the performance of the video SR algorithms. Extensive experimental comparisons with the state-of-the-art SR algorithms verify the effectiveness of our proposed multiple-frame video SR approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.