Proceedings of the 2020 International Conference on Multimedia Retrieval 2020
DOI: 10.1145/3372278.3390675
|View full text |Cite
|
Sign up to set email alerts
|

Heterogeneous Non-Local Fusion for Multimodal Activity Recognition

Abstract: In this work, we investigate activity recognition using multimodal inputs from heterogeneous sensors. Activity recognition is commonly tackled from a single-modal perspective using videos. In case multiple signals are used, they come from the same homogeneous modality, e.g. in the case of color and optical flow. Here, we propose an activity network that fuses multimodal inputs coming from completely different and heterogeneous sensors. We frame such a heterogeneous fusion as a non-local operation. The observat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 52 publications
0
4
0
Order By: Relevance
“…Combining video inputs with smart glove data in a non-local manner allows us to recognize activities, even when they are not visible on camera. [11] of interest in the beginning of a video-sensor clip and could recognize the motion-specific sensor reading in the end of the video-sensor clip, achieving temporal non-locality. As seen in Figure 3.2, two modalities, smart gloves and videos, can provide information hidden from the other modality.…”
Section: Combining Video and Wearable Sensor Modalitiesmentioning
confidence: 99%
“…Combining video inputs with smart glove data in a non-local manner allows us to recognize activities, even when they are not visible on camera. [11] of interest in the beginning of a video-sensor clip and could recognize the motion-specific sensor reading in the end of the video-sensor clip, achieving temporal non-locality. As seen in Figure 3.2, two modalities, smart gloves and videos, can provide information hidden from the other modality.…”
Section: Combining Video and Wearable Sensor Modalitiesmentioning
confidence: 99%
“…Stage 1 Stage 2 P@1 P@5 P@10 mAP R@50 R@100 60.72 46. 40 [48] for textual sequence-to-sequence tasks and recently adopted in video tasks [4,14,15]. Here, we investigate their potential for interaction detection.…”
Section: Relation Taggingmentioning
confidence: 99%
“…At present, multimedia retrieval needs to obtain the content needed by users from very rich video, pictures and text data [2,[4][5][6]24]. Image classification model is an important part of multimedia retrieval task.…”
Section: Introductionmentioning
confidence: 99%