Instance-based Model Adaptation for Direct Speech Translation

Gangi, Mattia Antonino Di; Nguyen, Viet-Nhat; Negri, Matteo; Turchi, Marco

doi:10.1109/icassp40776.2020.9053901

Cited by 8 publications

(5 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results are presented in Table 4. The performance of our character-level model is slightly worse but comparable with the results reported in (Di Gangi et al, 2020) and in (Nguyen et al, 2019) Unidirectional refers to one-to-one systems. The other results are computed with one multilingual system for En→De,NL and one for En→Es,Fr,It,Pt.…”

Section: Worksupporting

confidence: 86%

“…The results on all the languages of MuST-C are presented in Table 3. Our characterlevel results are similar but not identical to the ones presented in (Di Gangi et al, 2020). Our BPE-level results outperform the ones at character level by at least 1.2 BLEU point on En-Ru and up to 3.3 points on En-Fr, with improvements of about 2 points in most of the languages.…”

Section: Worksupporting

confidence: 66%

See 1 more Smart Citation

On Target Segmentation for Direct Speech Translation

Gangi¹,

Gaido²,

Negri³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Recent studies on direct speech translation show continuous improvements by means of data augmentation techniques and bigger deep learning models. While these methods are helping to close the gap between this new approach and the more traditional cascaded one, there are many incongruities among different studies that make it difficult to assess the state of the art. Surprisingly, one point of discussion is the segmentation of the target text. Character-level segmentation has been initially proposed to obtain an open vocabulary, but it results on long sequences and long training time. Then, subword-level segmentation became the state of the art in neural machine translation as it produces shorter sequences that reduce the training time, while being superior to word-level models. As such, recent works on speech translation started using target subwords despite the initial use of characters and some recent claims of better results at the character level. In this work, we perform an extensive comparison of the two methods on three benchmarks covering 8 language directions and multilingual training. Subword-level segmentation compares favorably in all settings, outperforming its character-level counterpart in a range of 1 to 3 BLEU points. * The first author performed this work while he was a Ph.D. student at FBK.

show abstract

Section: Worksupporting

confidence: 86%

Section: Worksupporting

confidence: 66%

On Target Segmentation for Direct Speech Translation

Gangi¹,

Gaido²,

Negri³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…On one hand, audio source avoids the error propagation and exposure bias introduced by using as context the translations generated at inference time. On the other, these problems are balanced by the easiness of extracting information from text rather than from audio [12]. In this work, we study both options.…”

Section: Context-aware Stmentioning

confidence: 99%

“…When we use the generated translations as context, its tokens are converted into vectors with word embeddings (namely, we re-use the decoder embeddings), summed with positional encoding and then provided to the encoder Transformer layers. When we use the audio as context, the input audio features are first processed by the encoder of the base model and then passed to the context encoder [12]. Sequential (Figure 1).…”

Section: Context-aware Stmentioning

confidence: 99%

Contextualized Translation of Automatically Segmented Speech

Gaido¹,

Gangi²,

Negri³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence level, but at inference time they are commonly fed with audio split by a voice activity detector (VAD). Since VAD segmentation is not syntaxinformed, the resulting segments do not necessarily correspond to well-formed sentences uttered by the speaker but, most likely, to fragments of one or more sentences. This segmentation mismatch degrades considerably the quality of ST models' output. So far, researchers have focused on improving audio segmentation towards producing sentence-like splits. In this paper, instead, we address the issue in the model, making it more robust to a different, potentially sub-optimal segmentation. To this aim, we train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context. We show that our context-aware solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.

show abstract

“…3 The instancebased method is used to slow the error by weighting the source samples and train the weighted source samples. 4 The feature-based methods usually transform the features of the source and the target domains into a shared space where the feature distributions of the two data sets match. The domain adaptive method, based on feature representation, is the most commonly used method.…”

Section: Introductionmentioning

confidence: 99%

Deep adversarial domain adaptation network

Chen

et al. 2020

International Journal of Advanced Robotic Systems

View full text Add to dashboard Cite

The advantage of adversarial domain adaptation is that it uses the idea of adversarial adaptation to confuse the feature distribution of two domains and solve the problem of domain transfer in transfer learning. However, although the discriminator completely confuses the two domains, adversarial domain adaptation still cannot guarantee the consistent feature distribution of the two domains, which may further deteriorate the recognition accuracy. Therefore, in this article, we propose a deep adversarial domain adaptation network, which optimises the feature distribution of the two confused domains by adding multi-kernel maximum mean discrepancy to the feature layer and designing a new loss function to ensure good recognition accuracy. In the last part, some simulation results based on the Office-31 and Underwater data sets show that the deep adversarial domain adaptation network can optimise the feature distribution and promote positive transfer, thus improving the classification accuracy.

show abstract

Instance-based Model Adaptation for Direct Speech Translation

Cited by 8 publications

References 23 publications

On Target Segmentation for Direct Speech Translation

On Target Segmentation for Direct Speech Translation

Contextualized Translation of Automatically Segmented Speech

Deep adversarial domain adaptation network

Contact Info

Product

Resources

About