OpenKiwi: An Open Source Framework for Quality Estimation

Kepler, Fabio Natanael; Trénous, Jonay; Treviso, Marcos; Vera, Miguel; Martins, André F. T.

doi:10.48550/arxiv.1902.08646

Cited by 4 publications

(7 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Next, we experimented on the effect of predictors pretrained with different language pairs by using the trained predictor weights provided along with the WMT20 shared task and OpenKiwi (Kepler et al, 2019b). We utilized the weight except for the embedding layers.…”

Section: Results Of Fine-tuning Pretrained Predictormentioning

confidence: 99%

“…We adapts "OpenKiwi" (Kepler et al, 2019b), an open-source framework for QE task, to construct our proposed ensemble-based QE model in Figure 1. Similar as other state-of-the-art methods (Kim et al, 2017;Wang et al, 2018), we use a neural-based architecture, which is mainly based on the predictor-estimator architecture initially proposed from (Kim et al, 2017).…”

Section: Approachmentioning

confidence: 99%

“…It is equivalent to maximize the likelihood of target sequences given the predicted multinomial distribution P . It is also used in Predictor-Estimator structure in our baseline model from Openkiwi (Kepler et al, 2019b).…”

Section: Lossmentioning

confidence: 99%

See 2 more Smart Citations

Ensemble-based Transfer Learning for Low-resource Machine Translation Quality Estimation

Wu¹,

Hsieh²,

Liu³

2021

Preprint

View full text Add to dashboard Cite

Quality Estimation (QE) of Machine Translation (MT) is a task to estimate the quality scores for given translation outputs from an unknown MT system. However, QE scores for low-resource languages are usually intractable and hard to collect. In this paper, we focus on the Sentence-Level QE Shared Task of the Fifth Conference on Machine Translation (WMT20), but in a more challenging setting. We aim to predict QE scores of given translation outputs when barely none of QE scores of that paired languages are given during training. We propose an ensemble-based predictorestimator QE model with transfer learning to overcome such QE data scarcity challenge by leveraging QE scores from other miscellaneous languages and translation results of targeted languages. Based on the evaluation results, we provide a detailed analysis of how each of our extension affects QE models on the reliability and the generalization ability to perform transfer learning under multilingual tasks. Finally, we achieve the best performance on the ensemble model combining the models pretrained by individual languages as well as different levels of parallel trained corpus with a Pearson's correlation of 0.298, which is 2.54 times higher than baselines.

show abstract

Section: Results Of Fine-tuning Pretrained Predictormentioning

confidence: 99%

Section: Approachmentioning

confidence: 99%

See 1 more Smart Citation

Ensemble-based Transfer Learning for Low-resource Machine Translation Quality Estimation

Wu¹,

Hsieh²,

Liu³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…A number of embedding-based metrics has proven to achieve the highest performance in recent WMT shared tasks for quality metrics (e.g. [7,8,12]). We take BERTScore as representative of this category.…”

Section: Related Workmentioning

confidence: 99%

“…Good evaluation metrics should have a high correlation with human judgement on the quality of translation. Recently some automatic metrics have achieved a significant correlation with human judgement on the WMT Metrics task datasets (see [7,8,12]). However, research has reported weaker correlation with low human assessment score ranges for segment-level evaluation [20,19].…”

Section: Introductionmentioning

confidence: 99%

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text

Saadany,

Orasan

2021

Preprint

View full text Add to dashboard Cite

Social media companies as well as authorities make extensive use of artificial intelligence (AI) tools to monitor postings of hate speech, celebrations of violence or profanity. Since AI software requires massive volumes of data to train computers, Machine Translation (MT) of the online content is commonly used to process posts written in several languages and hence augment the data needed for training. However, MT mistakes are a regular occurrence when translating sentiment-oriented user-generated content (UGC), especially when a low-resource language is involved. The adequacy of the whole process relies on the assumption that the evaluation metrics used give a reliable indication of the quality of the translation. In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors which can cause serious misunderstanding of the affect message. We compare the performance of three canonical metrics on meaningless translations where the semantic content is seriously impaired as compared to meaningful translations with a critical error which exclusively distorts the sentiment of the source text. We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.

show abstract

The TALP-UPC System for the WMT Similar Language Task: Statistical vs Neural Machine Translation

Biesialska¹,

Guardia²,

Costa-jussà³

2019

Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

View full text Add to dashboard Cite

Although the problem of similar language translation has been an area of research interest for many years, yet it is still far from being solved. In this paper, we study the performance of two popular approaches: statistical and neural. We conclude that both methods yield similar results; however, the performance varies depending on the language pair. While the statistical approach outperforms the neural one by a difference of 6 BLEU points for the Spanish-Portuguese language pair, the proposed neural model surpasses the statistical one by a difference of 2 BLEU points for Czech-Polish. In the former case, the language similarity (based on perplexity) is much higher than in the latter case. Additionally, we report negative results for the system combination with back-translation.Our TALP-UPC system submission won 1st place for Czech→Polish and 2nd place for Spanish→Portuguese in the official evaluation of the 1st WMT Similar Language Translation task.

show abstract

OpenKiwi: An Open Source Framework for Quality Estimation

Cited by 4 publications

References 5 publications

Ensemble-based Transfer Learning for Low-resource Machine Translation Quality Estimation

Ensemble-based Transfer Learning for Low-resource Machine Translation Quality Estimation

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text

The TALP-UPC System for the WMT Similar Language Task: Statistical vs Neural Machine Translation

Contact Info

Product

Resources

About