2018
DOI: 10.1080/0907676x.2018.1485716
|View full text |Cite
|
Sign up to set email alerts
|

Optimising the Europarl corpus for translation studies with the EuroparlExtract toolkit

Abstract: The freely available European Parliament Proceedings Parallel Corpus, or Europarl, is one of the largest multilingual corpora available to date. Surprisingly, bibliometric analyses show that it has hardly been used in translation studies. Its low impact in translation studies may partly be attributed to the fact that the Europarl corpus is distributed in a format that largely disregards the needs of translation research. In order to make the wealth of linguistic data from Europarl easily and readily available … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 18 publications
0
7
0
Order By: Relevance
“…EuroParl is a parallel corpus with translations provided by professional human translators and is extracted from the European Parliament website by Koehn ( 2005 ). This corpus was chosen in light of its “(…) free availability, size, linguistic diversity, data authenticity, and sentence-aligned architecture as well as homogeneity in terms of register, text type, and subject domain (…)” (Ustaszewski, 2019 : 107), all of which make it ideal for translation-oriented corpus-based inquiries, moreover, if applied as a data-driven approach that serves to characterize mental or sociocultural aspects of interlingual phenomena.…”
Section: Methodsmentioning
confidence: 99%
“…EuroParl is a parallel corpus with translations provided by professional human translators and is extracted from the European Parliament website by Koehn ( 2005 ). This corpus was chosen in light of its “(…) free availability, size, linguistic diversity, data authenticity, and sentence-aligned architecture as well as homogeneity in terms of register, text type, and subject domain (…)” (Ustaszewski, 2019 : 107), all of which make it ideal for translation-oriented corpus-based inquiries, moreover, if applied as a data-driven approach that serves to characterize mental or sociocultural aspects of interlingual phenomena.…”
Section: Methodsmentioning
confidence: 99%
“…The corpus for each language is a Wikipedia dump from 27 July 2020, cleaned using tools from Bojanowski et al (2017), and tokenized using Eu-ropalExtract (Ustaszewski, 2019), except for Bengali and Hindi, which are tokenized using NLTK (Bird et al, 2009). Because DUONG2016 and HAKIMI2020 can learn high quality cross-lingual embeddings from monolingual corpora of only 5M sentences each, we down-sample the English corpus for these two methods to 5M sentences.…”
Section: Training Corpora and Dictionariesmentioning
confidence: 99%
“…The corpus for each language is a Wikipedia dump from 27 July 2020, cleaned using tools from , and tokenized using Eu-ropalExtract (Ustaszewski, 2019), except for Bengali and Hindi, which are tokenized using NLTK . Because DUONG2016 and HAKIMI2020 can learn high quality cross-lingual embeddings from monolingual corpora of only 5M sentences each, we down-sample the English corpus for these two methods to 5M sentences.…”
Section: Training Corpora and Dictionariesmentioning
confidence: 99%