Interpreting Video Features: A Comparison of 3D Convolutional Networks and Convolutional LSTM Networks

Mänttäri, Joonatan; Broomé, Sofia; Folkesson, John

doi:10.1007/978-3-030-69541-5_25

Cited by 21 publications

(11 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Experimental Setup. Our experiments are conducted on the validation set of the large-scale Something-Something v2 dataset [10], frequently used to probe video models [1,29,33] due to its fine-grained nature, large number of classes (174), and the temporal characteristics of most of its classes e.g. "Pulling [...] from behind of [...]".…”

Section: Methodsmentioning

confidence: 99%

“…Evaluating a model on feature subsets is challenging as models rarely support the notion of a 'missing feature'. Two approaches exist: re-training the model on all combinations of features [28] or substituting missing features with those from a reference [14,16,22,24,25,29,30], but both approaches have limitations. Retraining is computationally infeasible for more than a handful of features, and the choice of reference in feature substitution has a significant impact on the resulting attribution values [18,31].…”

Section: Element Attribution In Variable-length Sequencesmentioning

confidence: 99%

“…Explaining Video Models. Most works [29,[35][36][37][38]41] in attribution for video understanding use backprop methods originally designed for images, such as Grad-CAM [20] or EBP [21]. A few recent works propose video-specific attribution methods.…”

Section: Related Workmentioning

confidence: 99%

“…They extend the approach in [48] by backpropagating gradients to different depths in the network. Mänttäri et al [29] apply meaningful perturbations [24] to learn a temporal mask over the input. To keep the number of input frames fixed, they replace missing frames by duplication.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Play Fair: Frame Attributions in Video Models

Price¹,

Damen²

2020

Preprint

View full text Add to dashboard Cite

In this paper, we introduce an attribution method for explaining action recognition models. Such models fuse information from multiple frames within a video, through score aggregation or relational reasoning. We break down a model's class score into the sum of contributions from each frame, fairly. Our method adapts an axiomatic solution to fair reward distribution in cooperative games, known as the Shapley value, for elements in a variable-length sequence, which we call the Element Shapley Value (ESV). Critically, we propose a tractable approximation of ESV that scales linearly with the number of frames in the sequence. We employ ESV to explain two action recognition models (TRN and TSN) on the fine-grained dataset Something-Something. We offer detailed analysis of supporting/distracting frames, and the relationships of ESVs to the frame's position, class prediction, and sequence length. We compare ESV to naive baselines and two commonly used feature attribution methods: Grad-CAM and Integrated-Gradients.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Element Attribution In Variable-length Sequencesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Play Fair: Frame Attributions in Video Models

Price¹,

Damen²

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…This class of RNN favors work with long sequences. This is why LSTMs have been applied to different areas of knowledge, including medical diagnosis, being frequently used in problems involving video classification and recognition of human activities [ 92 , 93 , 94 , 95 ].…”

Section: Theoretical Backgroundmentioning

confidence: 99%

Pulmonary COVID-19: Learning Spatiotemporal Features Combining CNN and LSTM Networks for Lung Ultrasound Video Classification

Barros

Lacerda

Albuquerque

et al. 2021

Sensors

View full text Add to dashboard Cite

Deep Learning is a very active and important area for building Computer-Aided Diagnosis (CAD) applications. This work aims to present a hybrid model to classify lung ultrasound (LUS) videos captured by convex transducers to diagnose COVID-19. A Convolutional Neural Network (CNN) performed the extraction of spatial features, and the temporal dependence was learned using a Long Short-Term Memory (LSTM). Different types of convolutional architectures were used for feature extraction. The hybrid model (CNN-LSTM) hyperparameters were optimized using the Optuna framework. The best hybrid model was composed of an Xception pre-trained on ImageNet and an LSTM containing 512 units, configured with a dropout rate of 0.4, two fully connected layers containing 1024 neurons each, and a sequence of 20 frames in the input layer (20×2018). The model presented an average accuracy of 93% and sensitivity of 97% for COVID-19, outperforming models based purely on spatial approaches. Furthermore, feature extraction using transfer learning with models pre-trained on ImageNet provided comparable results to models pre-trained on LUS images. The results corroborate with other studies showing that this model for LUS classification can be an important tool in the fight against COVID-19 and other lung diseases.

show abstract

Play Fair: Frame Attributions in Video Models

Price

Damen

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Interpreting Video Features: A Comparison of 3D Convolutional Networks and Convolutional LSTM Networks

Cited by 21 publications

References 18 publications

Play Fair: Frame Attributions in Video Models

Play Fair: Frame Attributions in Video Models

Pulmonary COVID-19: Learning Spatiotemporal Features Combining CNN and LSTM Networks for Lung Ultrasound Video Classification

Play Fair: Frame Attributions in Video Models

Contact Info

Product

Resources

About