“…The self attention mechanism of transformers provides a natural bridge to connect multimodal signals. Applications include audio enhancement [17,63], speech recognition [26], image segmentation [63,73], cross-modal sequence generation [21,37,38], video retrieval [20] and image/video captioning/classification [28,29,36,44,60,61]. A common paradigm (which we also adapt) is to use the output representations of single modality convolutional networks as inputs to the transformer [20,35].…”