OpusFilter: A Configurable Parallel Corpus Filtering Toolbox

Aulamo, Mikko; Virpioja, Sámi; Tiedemann, Jörg

doi:10.18653/v1/2020.acl-demos.20

Cited by 19 publications

(24 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…OpenSubtitles2018, which consists of subtitle translations, and corpora gathered by crawling the internet, Common Crawl and ParaCrawl, are especially likely to contain noisy data. For filtering the corpora, we utilize OpusFilter (Aulamo et al, 2020), a toolbox for creating clean parallel corpora.…”

Section: Data Preprocessingmentioning

confidence: 99%

See 1 more Smart Citation

The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task

Vázquez

Aulamo²,

Sulubacak

et al. 2020

Proceedings of the 17th International Conference on Spoken Language Translation

Self Cite

View full text Add to dashboard Cite

This paper describes the University of Helsinki Language Technology group's participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year's task objective, we train both cascade and endto-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.

show abstract

Section: Data Preprocessingmentioning

confidence: 99%

“…First, we extract six feature values for each of the sentence pairs. In particular, we apply the following features: CharacterScore, CrossEntropy, LanguageID, NonZeroNumeral, TerminalPunctuation and WordAlign, each of which is defined in Aulamo et al (2020). Secondly, we train a logistic regression classifier based on those features.…”

Section: Data Preprocessingmentioning

confidence: 99%

The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task

Vázquez

Aulamo²,

Sulubacak

et al. 2020

Proceedings of the 17th International Conference on Spoken Language Translation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Needless to say, these language pairs pose big challenges since none of them benefits from large quantities of parallel data and there is limited monolingual data. For our participation, we focused our efforts mainly on three aspects: (1) gathering additional parallel and monolingual data for each language, taking advantage in particular of the OPUS corpus collection (Tiedemann, 2012), the JHU Bible corpus (McCarthy et al, 2020) and translations of political constitutions of various Latin American countries, (2) cleaning and filtering the corpora to maximize their quality with the OpusFilter toolbox (Aulamo et al, 2020), and (3) contrasting different training techniques that could take advantage of the scarce data available.…”

Section: Introductionmentioning

confidence: 99%

The Helsinki submission to the AmericasNLP shared task

Vázquez¹,

Scherrer²,

Virpioja³

et al. 2021

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Self Cite

View full text Add to dashboard Cite

The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs. Our multilingual NMT models reached the first rank on all language pairs in track 1, and first rank on nine out of ten language pairs in track 2. We focused our efforts on three aspects: (1) the collection of additional data from various sources such as Bibles and political constitutions, (2) the cleaning and filtering of training data with the OpusFilter toolkit, and (3) different multilingual training techniques enabled by the latest version of the OpenNMT-py toolkit to make the most efficient use of the scarce data. This paper describes our efforts in detail.

show abstract

“…Needless to say, these language pairs pose big challenges since none of them benefits from large quantities of parallel data and there is limited monolingual data. For our participation, we focused our efforts mainly on three aspects: (1) gathering additional parallel and monolingual data for each language, taking advantage in particular of the OPUS corpus collection , the JHU Bible corpus and translations of political constitutions of various Latin American countries, (2) cleaning and filtering the corpora to maximize their quality with the OpusFilter toolbox (Aulamo et al, 2020), and (3) contrasting different training techniques that could take advantage of the scarce data available.…”

Section: Introductionmentioning

confidence: 99%

“…1 2 Data preparation A main part of our effort was directed to finding relevant corpora that could help with the translation tasks, as well as to make the best out of the data provided by the organizers. In order to have an efficient procedure to maintain and process the data sets for all the ten languages, we utilized the Opus-Filter toolbox 2 (Aulamo et al, 2020). It provides both ready-made and extensible methods for combining, cleaning, and filtering parallel and monolingual corpora.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Mager¹,

Oncevay²,

Rios³

et al. 2021

View full text Add to dashboard Cite

This area is in all probability unmatched, anywhere in the world, in its linguistic multiplicity and diversity. A couple of thousand languages and dialects, at present divided into 17 large families and 38 small ones, with several hundred unclassified single languages, are on record. In one small portion of the area, in Mexico just north of the Isthmus of Tehuantepec, one finds a diversity of linguistic type hard to match on an entire continent in the Old World.

show abstract

OpusFilter: A Configurable Parallel Corpus Filtering Toolbox

Cited by 19 publications

References 11 publications

The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task

The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task

The Helsinki submission to the AmericasNLP shared task

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Contact Info

Product

Resources

About