On Vision Features in Multimodal Machine Translation

Li, Bei; Lv, Chuanhao; Zhou, Zefan; Zhou, Tao; Xiao, Tao; Ma, Anxiang; Zhu, Jun

doi:10.18653/v1/2022.acl-long.438

Cited by 22 publications

(26 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gated fusion techniques are widely used in combining the representations from different modalities as is done in some of the previous works [9]. In this method, for any input sample which consists of image I, source text S and target text T , the image features are obtained using the OpenAI CLIP's Vision Transformer(ViT) model [14] as ViT(I) and the textual embeddings are obtained from the standard transformer encoder as H S .…”

Section: Gated Fusion Methodsmentioning

confidence: 99%

“…This method was implemented based on Multimodal Machine Translation where the sigmoid gate function was replaced 3: Sample Analysis with tanh. All the parameters were kept constant as in [9], except for the learning rate which was changed to 0.001 and the max updates to 800,000. For evaluation, the average of last 10 checkpoints was used for more reliable results.…”

Section: Gated Fusion Methodsmentioning

confidence: 99%

“…Error correction can be viewed as similar to machine translation (MT) task where a sequence-to-sequence model like transformers [8] with large amounts of high-quality data can lead to excellent results. Inspired by multi-modal MT and other related tasks like visual text correction [9,10], in this paper, we propose a multi-modal ASR error correction method which utilizes visual information.…”

Section: Introductionmentioning

confidence: 99%

“…And the ASR transcripts were obtained from the Huggingface wav2vec model and the Google ASR API to show that the method is independent of the ASR model; (2)We propose two ways to utilize the visual information for ASR text correction. Firstly, a gated fusion method where the image features are concatenated with the textual embeddings, similar to previous works [9]. Secondly, we propose a prompt-based method to better utilize large-scale text data, where the captions from the images are used as prompts for ASR correction to provide more context.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Linguistically Informed Post-processing for ASR Error correction in Sanskrit

Kumar¹,

Adiga²,

Ranjan³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Aiming to improve the Automatic Speech Recognition (ASR) outputs with a post-processing step, ASR error correction (EC) techniques have been widely developed due to their efficiency in using parallel text data. Previous works mainly focus on using text or/ and speech data, which hinders the performance gain when not only text and speech information, but other modalities, such as visual information are critical for EC. The challenges are mainly two folds: one is that previous work fails to emphasize visual information, thus rare exploration has been studied. The other is that the community lacks a high-quality benchmark where visual information matters for the EC models. Therefore, this paper provides 1) simple yet effective methods, namely gated fusion and image captions as prompts to incorporate visual information to help EC; 2) large-scale benchmark datasets, namely Visual-ASR-EC, where each item in the training data consists of visual, speech, and text information, and the test data are carefully selected by human annotators to ensure that even humans could make mistakes when visual information is missing. Experimental results show that using captions as prompts could effectively use the visual information and surpass state-of-the-art methods by upto 1.2% in Word Error Rate(WER), which also indicates that visual information is critical in our proposed Visual-ASR-EC dataset.

show abstract

Section: Gated Fusion Methodsmentioning

confidence: 99%

Section: Gated Fusion Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Linguistically Informed Post-processing for ASR Error correction in Sanskrit

Kumar¹,

Adiga²,

Ranjan³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Then, we apply the gated fusion mechanism (Zhang et al, 2020;Wu et al, 2021;Li et al, 2022a) to fuse H language and H vision . The fused output H fuse ∈ R n×d is obtained by:…”

Section: Model Architecturementioning

confidence: 99%

Multimodal Chain-of-Thought Reasoning in Language Models

Zhang¹,

Zhang²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%→91.68% accuracy) and even surpasses human performance on the ScienceQA benchmark. Code is publicly available. 1

show abstract