2023
DOI: 10.1007/s40747-023-00998-5
|View full text |Cite
|
Sign up to set email alerts
|

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Abstract: Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 14 publications
references
References 42 publications
0
0
0
Order By: Relevance