Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations 2020
DOI: 10.18653/v1/2020.acl-demos.20
|View full text |Cite
|
Sign up to set email alerts
|

OpusFilter: A Configurable Parallel Corpus Filtering Toolbox

Abstract: This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the eff… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
24
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 19 publications
(24 citation statements)
references
References 11 publications
0
24
0
Order By: Relevance
“…OpenSubtitles2018, which consists of subtitle translations, and corpora gathered by crawling the internet, Common Crawl and ParaCrawl, are especially likely to contain noisy data. For filtering the corpora, we utilize OpusFilter (Aulamo et al, 2020), a toolbox for creating clean parallel corpora.…”
Section: Data Preprocessingmentioning
confidence: 99%
See 1 more Smart Citation
“…OpenSubtitles2018, which consists of subtitle translations, and corpora gathered by crawling the internet, Common Crawl and ParaCrawl, are especially likely to contain noisy data. For filtering the corpora, we utilize OpusFilter (Aulamo et al, 2020), a toolbox for creating clean parallel corpora.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…First, we extract six feature values for each of the sentence pairs. In particular, we apply the following features: CharacterScore, CrossEntropy, LanguageID, NonZeroNumeral, TerminalPunctuation and WordAlign, each of which is defined in Aulamo et al (2020). Secondly, we train a logistic regression classifier based on those features.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…Needless to say, these language pairs pose big challenges since none of them benefits from large quantities of parallel data and there is limited monolingual data. For our participation, we focused our efforts mainly on three aspects: (1) gathering additional parallel and monolingual data for each language, taking advantage in particular of the OPUS corpus collection (Tiedemann, 2012), the JHU Bible corpus (McCarthy et al, 2020) and translations of political constitutions of various Latin American countries, (2) cleaning and filtering the corpora to maximize their quality with the OpusFilter toolbox (Aulamo et al, 2020), and (3) contrasting different training techniques that could take advantage of the scarce data available.…”
Section: Introductionmentioning
confidence: 99%
“…Needless to say, these language pairs pose big challenges since none of them benefits from large quantities of parallel data and there is limited monolingual data. For our participation, we focused our efforts mainly on three aspects: (1) gathering additional parallel and monolingual data for each language, taking advantage in particular of the OPUS corpus collection , the JHU Bible corpus and translations of political constitutions of various Latin American countries, (2) cleaning and filtering the corpora to maximize their quality with the OpusFilter toolbox (Aulamo et al, 2020), and (3) contrasting different training techniques that could take advantage of the scarce data available.…”
Section: Introductionmentioning
confidence: 99%
“…1 2 Data preparation A main part of our effort was directed to finding relevant corpora that could help with the translation tasks, as well as to make the best out of the data provided by the organizers. In order to have an efficient procedure to maintain and process the data sets for all the ten languages, we utilized the Opus-Filter toolbox 2 (Aulamo et al, 2020). It provides both ready-made and extensible methods for combining, cleaning, and filtering parallel and monolingual corpora.…”
Section: Introductionmentioning
confidence: 99%