2020
DOI: 10.1109/access.2020.3042484
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 72 publications
(24 citation statements)
references
References 57 publications
0
24
0
Order By: Relevance
“…This is currently done using deep models that combine image embedding with the ability to generate descriptions using recurrent Long short-term memory (LSTM) neural layers such as [ 41 , 42 , 43 ]. A large survey on this field can be found at [ 44 , 45 ]. Thus, one could use such automatically generated image descriptions with sentences that have been generated by the text summarizer (see Section 2.1.1 ) to determine the similarity between the text and the various images we have, in order to select the image that minimises the given distance metric [ 46 ].…”
Section: Materials and Methodsmentioning
confidence: 99%
“…This is currently done using deep models that combine image embedding with the ability to generate descriptions using recurrent Long short-term memory (LSTM) neural layers such as [ 41 , 42 , 43 ]. A large survey on this field can be found at [ 44 , 45 ]. Thus, one could use such automatically generated image descriptions with sentences that have been generated by the text summarizer (see Section 2.1.1 ) to determine the similarity between the text and the various images we have, in order to select the image that minimises the given distance metric [ 46 ].…”
Section: Materials and Methodsmentioning
confidence: 99%
“…Proposition (1) in fact shows that in visual tracking the correlation filter and the convolution filter are equivalent in the sense of equal minimum mean-square error of estimation, under the condition that the ideal filter response is a 2-D centrosymmetric Gaussian function and the optimal solutions exist. More specifically, * As long as the optimal solution of one of the filters is known, the optimal solution of the other can be obtained immediately; * The mean-square errors of the filter responses of the two optimal solutions to a given detection sample are equal; * The filter responses of the two optimal solutions to a given detection sample are symmetric about the origin.…”
Section: Proof Of the Equivalencementioning
confidence: 99%
“…Two currently predominant approaches are discriminative correlation filter (DCF)-based methods and deep learning (DL)-based methods. Deep learning has been intensively studied and demonstrated remarkable success in a wide range of computer vision areas, such as image classification, object detection, image caption and semantic segmentation, etc [23,35,16,48,1]. Inspired by deep learning breakthroughs in these fields, DL-based methods have attracted considerable interest in the visual tracking community and witness rapid development and great advances in recent years.…”
Section: Introductionmentioning
confidence: 99%
“…Many researchers present different models on video captioning [10,25,24,17,27], mostly with limited success and many constraints. In video captioning [4], Krishna et al [17], however, presented Densecaptioning, which focuses on detecting multiple events that occur in a video by jointly localizing temporal proposals of interest and then describing each with natural language. This model introduced a new captioning module that uses contextual information from past and future events to describe all events jointly.…”
Section: Image and Video Captioningmentioning
confidence: 99%