2021
DOI: 10.48550/arxiv.2105.04489
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Abstract: When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 57 publications
(118 reference statements)
0
0
0
Order By: Relevance