2021
DOI: 10.1007/978-3-030-69541-5_25
|View full text |Cite
|
Sign up to set email alerts
|

Interpreting Video Features: A Comparison of 3D Convolutional Networks and Convolutional LSTM Networks

Abstract: A number of techniques for interpretability have been presented for deep learning in computer vision, typically with the goal of understanding what the networks have actually learned underneath a given classification decision. However, interpretability for deep video architectures is still in its infancy and we do not yet have a clear concept of how to decode spatiotemporal features. In this paper, we present a study comparing how 3D convolutional networks and convolutional LSTM networks learn features across … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(11 citation statements)
references
References 18 publications
0
11
0
Order By: Relevance
“…Experimental Setup. Our experiments are conducted on the validation set of the large-scale Something-Something v2 dataset [10], frequently used to probe video models [1,29,33] due to its fine-grained nature, large number of classes (174), and the temporal characteristics of most of its classes e.g. "Pulling [...] from behind of [...]".…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Experimental Setup. Our experiments are conducted on the validation set of the large-scale Something-Something v2 dataset [10], frequently used to probe video models [1,29,33] due to its fine-grained nature, large number of classes (174), and the temporal characteristics of most of its classes e.g. "Pulling [...] from behind of [...]".…”
Section: Methodsmentioning
confidence: 99%
“…Evaluating a model on feature subsets is challenging as models rarely support the notion of a 'missing feature'. Two approaches exist: re-training the model on all combinations of features [28] or substituting missing features with those from a reference [14,16,22,24,25,29,30], but both approaches have limitations. Retraining is computationally infeasible for more than a handful of features, and the choice of reference in feature substitution has a significant impact on the resulting attribution values [18,31].…”
Section: Element Attribution In Variable-length Sequencesmentioning
confidence: 99%
See 2 more Smart Citations
“…This class of RNN favors work with long sequences. This is why LSTMs have been applied to different areas of knowledge, including medical diagnosis, being frequently used in problems involving video classification and recognition of human activities [ 92 , 93 , 94 , 95 ].…”
Section: Theoretical Backgroundmentioning
confidence: 99%