2022
DOI: 10.48550/arxiv.2202.09124
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusion

Abstract: We tackle a challenging task: multi-view and multi-modal event detection that detects events in a wide-range real environment by utilizing data from distributed cameras and microphones and their weak labels. In this task, distributed sensors are utilized complementarily to capture events that are difficult to capture with a single sensor, such as a series of actions of people moving in an intricate room, or communication between people located far apart in a room. For sensors to cooperate effectively in such a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…Subsequently, it was applied to various computer vision tasks such as classification [18], [19] and object detection [20]. In addition, the self-attention mechanism of the transformer was used in sensor fusion methods for object detection [21], and in multi-modal representation learning [22] by taking multi-modal tokens as input. A recent method for autonomous driving [23] combines the global context of RGB and LiDAR scenes by applying the self-attention on convoluted features in order to handle multimodal data well.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…Subsequently, it was applied to various computer vision tasks such as classification [18], [19] and object detection [20]. In addition, the self-attention mechanism of the transformer was used in sensor fusion methods for object detection [21], and in multi-modal representation learning [22] by taking multi-modal tokens as input. A recent method for autonomous driving [23] combines the global context of RGB and LiDAR scenes by applying the self-attention on convoluted features in order to handle multimodal data well.…”
mentioning
confidence: 99%
“…ViViT can be used as a architecture to model a long-range spatiotemporal context relationship for processing video sequence problem of extracting 1D time-series signals from 3D videos. In addition, the self-attention mechanism of the transformer has the advantage of effectively fusing features of various modalities by automatically highlighting important parts, such as widely used in sensor fusion and multimodal representation learning [21]- [23]. Therefore, we consider that transformer-based ViViT could not only combine the spatiotemporal information of different RGB and NIR modalities well, but also be suitable for video sequence problems by exploiting a long-range contextual clues.…”
mentioning
confidence: 99%