2021
DOI: 10.1109/access.2021.3108565
|View full text |Cite
|
Sign up to set email alerts
|

Video Description: Datasets & Evaluation Metrics

Abstract: Rapid expansion and the novel phenomenon of deep learning have manifested a variety of proposals and concerns in the area of video description, particularly in the recent past. Automatic event localization and textual alternatives generation for the complex and diverse visual data supplied in a video can be articulated as video description, bridging the two leading realms of computer vision and natural language processing. Several sequence-to-sequence algorithms are being proposed by splitting the task into tw… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 86 publications
0
9
0
Order By: Relevance
“…Limitations associated with memory requirements while generating captions is addressed in EtENet-IRv2 (Olivastri 2019), which is also an end-to-end trainable ED architecture proposing a gradient accumulating strategy employing Inception-ResNet-v2 (Szegedy et al 2017) and GoogLeNet (Szegedy et al 2015) with two-stage training for encoding. Evaluation of benchmark datasets (Rafiq et al 2021) showed significant improvement, but with a limitation on the computational resources required for end-to-end training.…”
Section: Shortcomingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Limitations associated with memory requirements while generating captions is addressed in EtENet-IRv2 (Olivastri 2019), which is also an end-to-end trainable ED architecture proposing a gradient accumulating strategy employing Inception-ResNet-v2 (Szegedy et al 2017) and GoogLeNet (Szegedy et al 2015) with two-stage training for encoding. Evaluation of benchmark datasets (Rafiq et al 2021) showed significant improvement, but with a limitation on the computational resources required for end-to-end training.…”
Section: Shortcomingsmentioning
confidence: 99%
“…The evaluation metric is considered best when it exhibits a significant correlation with the human scores (Zhang and Vogel 2010). A short description of the metrics mostly used to evaluate the automatically generated description is given below, For detailed computational concept along with the limitations, please refer to Rafiq et al (2021).…”
Section: Evaluation Metricsmentioning
confidence: 99%
“…Various datasets have been launched from time to time to exhibit an enhanced accomplishment for the task of video description, exploring a wide range of constrained and open domains like cooking by [11], [12], [13], [14], [15], and [16], human activities by [9], [23], [24], [8], and [25], social media by [20], and [19], movies by [17], and [18], TV shows by [21], and e-commerce by [22] presented in detail by [30]. Table 1 lists a brief overview of the key attributes and major statistics of existing multi-caption (dense/paragraph like) video description datasets.…”
Section: A Video Description Datasetsmentioning
confidence: 99%
“…Here, the spatial features (er) and motion features (or) are merged together and form yr as Eqn. (7),…”
Section: Decodermentioning
confidence: 99%
“…Spatio-temporal [6] relationships are to be extracted to capture the temporal details from the input video. Video captioning task is considered as the stretching task for some major reasons [7] as complex nature, temporal dependencies, interconnectivity, diversified objects, events, scenes, actions, etc. video captioning task is designed to generate captions in SVO pattern [8,9,10,11] about the visuals of the input video.…”
Section: Introductionmentioning
confidence: 99%