Multi-Head multimodal deep interest recommendation network

Yang, Mingbao; Zhou, Peng; Li, Shaobo; Zhang, Yuanmeng; Hu, Jianjun; Zhang, Ansi

doi:10.1016/j.knosys.2023.110689

Cited by 2 publications

(2 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Section: The Proposed Modelmentioning

confidence: 99%

“…The developed model uses a joint embedding space to represent the input signals, which allows the model to learn the relationships between different text and image modalities. The joint embedding space is created by combining the image and text hidden states or feature vectors, learned by Transformer encoders, using a multimodal projection head [15,20]. A semantically hierarchical common space is defined to account for the granularity of different modalities, and the contrastive loss method is employed to train the model.…”

Section: The Proposed Modelmentioning

confidence: 99%

See 1 more Smart Citation

A Multimodal Transformer Model for Recognition of Images from Complex Laparoscopic Surgical Videos

Abiyev,

Altabel,

Darwish

et al. 2024

Diagnostics

View full text Add to dashboard Cite

The determination of the potential role and advantages of artificial intelligence-based models in the field of surgery remains uncertain. This research marks an initial stride towards creating a multimodal model, inspired by the Video-Audio-Text Transformer, that aims to reduce negative occurrences and enhance patient safety. The model employs text and image embedding state-of-the-art models (ViT and BERT) to assess their efficacy in extracting the hidden and distinct features from the surgery video frames. These features are then used as inputs for convolution-free Transformer architectures to extract comprehensive multidimensional representations. A joint space is then used to combine the text and image features extracted from both Transformer encoders. This joint space ensures that the relationships between the different modalities are preserved during the combination process. The entire model was trained and tested on laparoscopic cholecystectomy (LC) videos encompassing various levels of complexity. Experimentally, a mean accuracy of 91.0%, a precision of 81%, and a recall of 83% were reached by the model when tested on 30 videos out of 80 from the Cholec80 dataset.

show abstract

Section: The Proposed Modelmentioning

confidence: 99%

Section: The Proposed Modelmentioning

confidence: 99%