2019
DOI: 10.48550/arxiv.1911.00232
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

Abstract: An event happening in the world is often made of different activities and actions that can unfold simultaneously or sequentially within a few seconds. However, most large-scale datasets built to train models for action recognition provide a single label per video clip. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information that would be mandatory to more completely comprehend different e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
12
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(13 citation statements)
references
References 35 publications
1
12
0
Order By: Relevance
“…Efficient Action Recognition. Action recognition has made rapid progress with the introduction of a number of large-scale datasets such as Kinetics [7] and Moments-In-Time [39,40]. Early methods have studied action recognition using shallow classification models such as SVM on top of local visual features extracted from a video [34,53].…”
Section: Related Workmentioning
confidence: 99%
“…Efficient Action Recognition. Action recognition has made rapid progress with the introduction of a number of large-scale datasets such as Kinetics [7] and Moments-In-Time [39,40]. Early methods have studied action recognition using shallow classification models such as SVM on top of local visual features extracted from a video [34,53].…”
Section: Related Workmentioning
confidence: 99%
“…[36,83,325]. Unlike these surveys, we will use a broader definition of action, one that includes actions of both human and non-human actors because (1) video datasets are being introduced that use this broader definition [177,178], (2) most deep learning metrics and methods are equally applicable to both settings, and (3) the colloquial use of action has no distinction between human and non-human actors. Merriam-Webster's Dictionary and the Oxford English Dictionary define action as "an act done" and "something done or performed", respectively [174,197].…”
Section: Data Diversity Robustness Transferability Performance Unders...mentioning
confidence: 99%
“…Each single-instance video lasts for 3 seconds. The dataset was improved to Multi-Moments in Time (M-MiT) [178] in 2019 by increasing the number of videos to 1.02 million, pruning vague classes, and increasing the number of labels per video (2.01 million total labels). MiT and M-MiT are interesting benchmarks because of the focus on inter-class and intra-class variation.…”
Section: Action Recognition Datasets Tablementioning
confidence: 99%
See 1 more Smart Citation
“…These sequential events present additional challenges for video datasets where single annotations may not be sufficient to explain the events depicted. Multi-label approaches to video annotation have attempted to address this problem by labeling multiple actions in a video [47,22,73]. However, these methods focus on single domain annotations, such as actions or objects, and do not capture additional contextual information, such as "person angrily putting down the dirty glass on a rusted table", which can change the interpretation of an event and how it fits into a sequence of observations.…”
Section: Introductionmentioning
confidence: 99%