TMSDNet: Transformer with multi‐scale dense network for single and multi‐view 3D reconstruction

Zhu, Xiaoqiang; Yao, Xinsheng; Zhang, Junjie; Zhu, Mengyao; You, Lihua; Yang, Xiaosong; Zhang, Jianjun; Zhao, He; Zeng, Dan

doi:10.1002/cav.2201

Cited by 2 publications

(1 citation statement)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Iashin et al 20 proposed a dual-modal Transformer structure designed to process both audio and video inputs, fostering mutual learning between the two modalities. Zhu et al 21 proposed a new Transformer framework (TMSDNet) for single-view and multi-view 3D reconstruction for 3D reconstruction problems. Zhu et al 22 enhanced BERT and introduced ActBERT for self-supervised learning of joint video-text representation from unlabeled videos.…”

Section: Transformer-related Approachesmentioning

confidence: 99%

Adaptive information fusion network for multi‐modal personality recognition

Bao,

Liu,

et al. 2024

Computer Animation & Virtual

View full text Add to dashboard Cite

Personality recognition is of great significance in deepening the understanding of social relations. While personality recognition methods have made significant strides in recent years, the challenge of heterogeneity between modalities during feature fusion still needs to be solved. This paper introduces an adaptive multi‐modal information fusion network (AMIF‐Net) capable of concurrently processing video, audio, and text data. First, utilizing the AMIF‐Net encoder, we process the extracted audio and video features separately, effectively capturing long‐term data relationships. Then, adding adaptive elements in the fusion network can alleviate the problem of heterogeneity between modes. Lastly, we concatenate audio‐video and text features into a regression network to obtain Big Five personality trait scores. Furthermore, we introduce a novel loss function to address the problem of training inaccuracies, taking advantage of its unique property of exhibiting a peak at the critical mean. Our tests on the ChaLearn First Impressions V2 multi‐modal dataset show partial performance surpassing state‐of‐the‐art networks.

show abstract

Section: Transformer-related Approachesmentioning

confidence: 99%