2022
DOI: 10.1007/978-3-031-19830-4_24
|View full text |Cite
|
Sign up to set email alerts
|

EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(3 citation statements)
references
References 56 publications
0
3
0
Order By: Relevance
“…With large scale and realistic datasets, e.g., AudioSet [21], advanced network architectures have been adopted for audio classification including convolutional neural networks [23,24], convolutional-attention networks [25,26], and recent pure-attention based networks [1,27]. Particularly, AST [1] outperforms previous state-of-the-art audio classification approaches, and obtains widely adoption in many tasks, e.g., multimodal event classification [28,29] and video retrieval [30]. CMKD [31] further designs cross-modal knowledge distillation between convolutional networks and AST for audio classification.…”
Section: Related Workmentioning
confidence: 99%
“…With large scale and realistic datasets, e.g., AudioSet [21], advanced network architectures have been adopted for audio classification including convolutional neural networks [23,24], convolutional-attention networks [25,26], and recent pure-attention based networks [1,27]. Particularly, AST [1] outperforms previous state-of-the-art audio classification approaches, and obtains widely adoption in many tasks, e.g., multimodal event classification [28,29] and video retrieval [30]. CMKD [31] further designs cross-modal knowledge distillation between convolutional networks and AST for audio classification.…”
Section: Related Workmentioning
confidence: 99%
“…[7] introduces structured multi-scale temporal decoder for self-attention. [18] suggests replacing the corresponding video with information-rich audio. They all work to reduce the computational effort of long video modeling to improve the performance.…”
Section: Related Workmentioning
confidence: 99%
“…As acoustic measurements are unaffected by illumination changes and occlusions, they are a reliable solution for round-the-clock 24 hours monitoring. In addition, acoustic data is typically more compact and cost-effective to process compared to raw video data [20], [21]. Cui et al [22] introduced an audio dataset AFFIA3K for FFIA consisting of 3000 labelled audio clips and demonstrated the practicality of audio-based FIFA.…”
mentioning
confidence: 99%