MuMu: Cooperative Multitask Learning-Based Guided Multimodal Fusion

Islam, Md Mofijul; Iqbal, Tarıq

doi:10.1609/aaai.v36i1.19988

Cited by 20 publications

(3 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using our VCMA dataset, the highest average accuracy of 99.81% is achieved by the augmented key point data and sensor data. The highest accuracy of the UTD-MHAD dataset is 97.6% (Islam 2022), and the highest accuracy of the Berkeley MHAD dataset is 99.6% (Ahmad and Khan 2020) using depth and inertial sensor data. Although the modalities used are different, they are all multimodal action recognition with sensor data.…”

Section: Discussionmentioning

confidence: 99%

Action recognition based on multimode fusion for VR online platform

Chen

et al. 2023

Virtual Reality

View full text Add to dashboard Cite

The current popular online communication platforms can convey information only in the form of text, voice, pictures, and other electronic means. The richness and reliability of information is not comparable to traditional face-to-face communication. The use of virtual reality (VR) technology for online communication is a viable alternative to face-to-face communication. In the current VR online communication platform, users are in a virtual world in the form of avatars, which can achieve “face-to-face” communication to a certain extent. However, the actions of the avatar do not follow the user, which makes the communication process less realistic. Decision-makers need to make decisions based on the behavior of VR users, but there are no effective methods for action data collection in VR environments. In our work, three modalities of nine actions from VR users are collected using a virtual reality head-mounted display (VR HMD) built-in sensors, RGB cameras and human pose estimation. Using these data and advanced multimodal fusion action recognition networks, we obtained a high accuracy action recognition model. In addition, we take advantage of the VR HMD to collect 3D position data and design a 2D key point augmentation scheme for VR users. Using the augmented 2D key point data and VR HMD sensor data, we can train action recognition models with high accuracy and strong stability. In data collection and experimental work, we focus our research on classroom scenes, and the results can be extended to other scenes.

show abstract

Section: Discussionmentioning

confidence: 99%

Action recognition based on multimode fusion for VR online platform

Chen

et al. 2023

Virtual Reality

View full text Add to dashboard Cite

show abstract

“…Moreover, hybrid fusion techniques have also been explored, combining feature-level and decision-level fusion approaches [17][18][19]. These techniques aim to leverage the benefits of both strategies by fusing low-level sensory features and high-level decision outputs.…”

Section: Related Work 21 Multimodal Fusionmentioning

confidence: 99%

Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features

Guo,

Liao,

et al. 2023

Entropy

View full text Add to dashboard Cite

The integration of information from multiple modalities is a highly active area of research. Previous techniques have predominantly focused on fusing shallow features or high-level representations generated by deep unimodal networks, which only capture a subset of the hierarchical relationships across modalities. However, previous methods are often limited to exploiting the fine-grained statistical features inherent in multimodal data. This paper proposes an approach that densely integrates representations by computing image features’ means and standard deviations. The global statistics of features afford a holistic perspective, capturing the overarching distribution and trends inherent in the data, thereby facilitating enhanced comprehension and characterization of multimodal data. We also leverage a Transformer-based fusion encoder to effectively capture global variations in multimodal features. To further enhance the learning process, we incorporate a contrastive loss function that encourages the discovery of shared information across different modalities. To validate the effectiveness of our approach, we conduct experiments on three widely used multimodal sentiment analysis datasets. The results demonstrate the efficacy of our proposed method, achieving significant performance improvements compared to existing approaches.

show abstract

“…Although these modalities also bring their own limitations and challenges, especially for some real-world applications, they have a huge potential to contribute to activity recognition performance as well as to unlock the full capabilities of the skeleton and inertial modalities. Besides, future work could benefit from using more advanced and sophisticated encoder architectures that may enable additional cross-modal fusion strategies [94,95]. While different supervised and SSL methods have been adapted to HAR in research studies and this dissertation, in particular, there is a limited amount of studies that are focused on analyzing representations produced by these algorithms.…”

Section: Power Of Multimodality In Sslmentioning

confidence: 99%

Feature representation learning for human activity recognition

Khaertdinov

View full text Add to dashboard Cite

MuMu: Cooperative Multitask Learning-Based Guided Multimodal Fusion

Cited by 20 publications

References 46 publications

Action recognition based on multimode fusion for VR online platform

Action recognition based on multimode fusion for VR online platform

Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features

Feature representation learning for human activity recognition

Contact Info

Product

Resources

About