2023
DOI: 10.1007/s10462-023-10414-6
|View full text |Cite
|
Sign up to set email alerts
|

Video description: A comprehensive survey of deep learning approaches

Abstract: Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(5 citation statements)
references
References 181 publications
0
1
0
Order By: Relevance
“…This involved creating coherent and specific sentences by selecting keywords based on predefined templates. However, this approach had limitations in supporting complex event representation and understanding multiple scenes in long videos because it is restricted to simple sentence structures [12,13]. Video captioning approaches can be broadly divided into two types: template-based approaches and sequence-based approaches.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This involved creating coherent and specific sentences by selecting keywords based on predefined templates. However, this approach had limitations in supporting complex event representation and understanding multiple scenes in long videos because it is restricted to simple sentence structures [12,13]. Video captioning approaches can be broadly divided into two types: template-based approaches and sequence-based approaches.…”
Section: Related Workmentioning
confidence: 99%
“…In order to present our results, we have performed a clear comparison of our proposed model with the selected methods available for the video captioning task. These methods include EEDVC [52], DCE [53], MFT [54], WLT [55], MDVC [12], EMVC [56], BMT [57], PPVC [38], and PDVC [35]. The performance analysis and evaluation of these methods are reported in Table 1.…”
Section: Comparisonmentioning
confidence: 99%
“…This iterative process continues until the stopping criterion is met. 15 L: the set of labeled instances U: the set of unlabeled instances φ (u¡): query strategy where u¡ E u B: number of instances to be selected at each iteration (batch size)…”
Section: Pool-based Active Learningmentioning
confidence: 99%
“…Dense video captioning involves visual understanding processes that locate different events in a video and generate descriptive captions for each interesting object. This approach represents video content in detail by transforming frame sequences into multiple descriptive sentences among multiple clips in a long video [26][27][28]. In the research of Shen et al [29], dense image captioning is migrated to the video field by combining the multi-scale suggestion module and a visual context perception mechanism.…”
Section: Related Workmentioning
confidence: 99%
“…This work contrasted our proposed model with the state-of-the-art methods for the dense video captioning task, consisting of EEDVC [43], DCE [8], MFT [26], WLT [27], SDVC [9], EHVC [31], MDVC [28], BMT [52], EMVC [51], PPVC [13], and PDVC [46]. The contrast results are displayed in Table 1.…”
Section: Comparison To the State-of-the-artmentioning
confidence: 99%