Neural Machine Translation with Phrase-Level Universal Visual Representations

Qingkai, Fang,; Feng, Yang

doi:10.18653/v1/2022.acl-long.390

Cited by 18 publications

(5 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The gated fusion mechanism is a popular technique for fusing representations from different sources (Wu et al, 2021;Fang and Feng, 2022;Lin et al, 2020;. The fused output is a weighted sum between the text representation and the selective attention output, in which the weight is controlled with the gate λ.…”

Section: Gated Fusionmentioning

confidence: 99%

Video-Helpful Multimodal Machine Translation

Li,

Shimizu,

Chu

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Videohelpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models. The EVA dataset and the SAFA model are available at: https://github.com/ku-nlp/videohelpful-MMT.git.

show abstract

Section: Gated Fusionmentioning

confidence: 99%

Video-Helpful Multimodal Machine Translation

Li,

Shimizu,

Chu

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Since it is difficult to train an end-to-end ST model directly, some training techniques like pretraining (Weiss et al, 2017;Berard et al, 2018;Bansal et al, 2019;Stoian et al, 2020;Wang et al, 2020b;Dong et al, 2021a;Alinejad and Sarkar, 2020;Zheng et al, 2021b;, multi-task learning (Le et al, 2020;Vydana et al, 2021;Tang et al, 2021b;Ye et al, 2021;Tang et al, 2021a), curriculum learning (Kano et al, 2017;Wang et al, 2020c), and meta-learning (Indurthi et al, 2020) have been applied. Recent work has introduced mixup on machine translation (Zhang et al, 2019b;Guo et al, 2022;Fang and Feng, 2022), sentence classification (Chen et al, 2020;Jindal et al, 2020;Sun et al, 2020), multilingual understanding , and speech recognition (Medennikov et al, 2018;Sun et al, 2021;Lam et al, 2021a;Meng et al, 2021), and obtained enhancements.…”

Section: Can the Final Model Still Perform Mt Task?mentioning

confidence: 99%

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Qingkai¹,

Ye²,

Li³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a selflearning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

show abstract

“…However, these approaches often rely on time-consuming parsing tools to extract phrases. For the learning of phrase representations, averaging token representations is commonly used (Fang and Feng, 2022;Ma et al, 2022). While this method is simple, it fails to effectively capture the overall semantics of the phrases, thereby impacting the model performance.…”

Section: Introductionmentioning

confidence: 99%

Enhancing Neural Machine Translation with Semantic Units

Huang,

Gu,

Zhuocheng

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Conventional neural machine translation (NMT) models typically use subwords and words as the basic units for model input and comprehension. However, complete words and phrases composed of several tokens are often the fundamental units for expressing semantics, referred to as semantic units. To address this issue, we propose a method Semantic Units for Machine Translation (SU4MT) which models the integral meanings of semantic units within a sentence, and then leverages them to provide a new perspective for understanding the sentence. Specifically, we first propose Word Pair Encoding (WPE), a phrase extraction method to help identify the boundaries of semantic units. Next, we design an Attentive Semantic Fusion (ASF) layer to integrate the semantics of multiple subwords into a single vector: the semantic unit representation. Lastly, the semantic-unit-level sentence representation is concatenated to the token-level one, and they are combined as the input of encoder. Experimental results demonstrate that our method effectively models and leverages semantic-unit-level information and outperforms the strong baselines. The code is available at https://github.com/ictnlp/SU4MT.

show abstract

Neural Machine Translation with Phrase-Level Universal Visual Representations

Cited by 18 publications

References 25 publications

Video-Helpful Multimodal Machine Translation

Video-Helpful Multimodal Machine Translation

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Enhancing Neural Machine Translation with Semantic Units

Contact Info

Product

Resources

About