2023
DOI: 10.3390/math11173685
|View full text |Cite
|
Sign up to set email alerts
|

Parallel Dense Video Caption Generation with Multi-Modal Features

Xuefei Huang,
Ka-Hou Chan,
Wei Ke
et al.

Abstract: The task of dense video captioning is to generate detailed natural-language descriptions for an original video, which requires deep analysis and mining of semantic captions to identify events in the video. Existing methods typically follow a localisation-then-captioning sequence within given frame sequences, resulting in caption generation that is highly dependent on which objects have been detected. This work proposes a parallel-based dense video captioning method that can simultaneously address the mutual co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 60 publications
0
1
0
Order By: Relevance
“…In other words, convolutional layers are used to extract spatial features from the frames, and the spatial features are sent to LSTM layers at each time step to model temporal sequences, as shown in Figure 5. In this way, the network learns spatial and temporal features immediately in an end-to-end training process, which makes the model more stable [22][23][24][25][26]. This means that most of the time ϕ_v (v_t), the convolutional inference, and training can be completed in parallel over time.…”
Section: Long-term Recurrent Convolutional Modelmentioning
confidence: 99%
“…In other words, convolutional layers are used to extract spatial features from the frames, and the spatial features are sent to LSTM layers at each time step to model temporal sequences, as shown in Figure 5. In this way, the network learns spatial and temporal features immediately in an end-to-end training process, which makes the model more stable [22][23][24][25][26]. This means that most of the time ϕ_v (v_t), the convolutional inference, and training can be completed in parallel over time.…”
Section: Long-term Recurrent Convolutional Modelmentioning
confidence: 99%