2013
DOI: 10.2478/pralin-2013-0013
|View full text |Cite
|
Sign up to set email alerts
|

XenC: An Open-Source Tool for Data Selection in Natural Language Processing

Abstract: In this paper we describe XenC, an open-source tool for data selection aimed at Natural Language Processing (NLP) in general and Statistical Machine Translation (SMT) or Automatic Speech Recognition (ASR) in particular. Usually, when building a SMT or ASR system, the considered task is related to a specific domain of application, like news articles or scientific talks for instance. The goal of XenC is to allow selection of relevant data regarding the considered task, which will be used to build the statistical… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 45 publications
(41 citation statements)
references
References 4 publications
0
41
0
Order By: Relevance
“…The Czech monolingual corpus News-2016 was backtranslated to English using the single best system provided by the University of Edinburgh from WMT'16. 3 We then added five copies of Newscommentary and the news subcorpus from Czeng, as well as 5M sentences from the Czeng EU corpus randomly selected after running modified Moore-Lewis filtering with XenC (Rousseau, 2013). This resulted in about 14M parallel sentences.…”
Section: Data and Preprocessingmentioning
confidence: 99%
“…The Czech monolingual corpus News-2016 was backtranslated to English using the single best system provided by the University of Edinburgh from WMT'16. 3 We then added five copies of Newscommentary and the news subcorpus from Czeng, as well as 5M sentences from the Czeng EU corpus randomly selected after running modified Moore-Lewis filtering with XenC (Rousseau, 2013). This resulted in about 14M parallel sentences.…”
Section: Data and Preprocessingmentioning
confidence: 99%
“…In order to select the most appropriate amount of monolingual data, we employ data selection techniques based on cross-entropy criterion using 2 for that we used a modified version of the m2scorer script that could be distributed Xenc 3 (Rousseau, 2013). The selected data is determined in such a way that the corresponding LM minimize the perplexity calculated on the development set.…”
Section: Smt System For Error Correctionmentioning
confidence: 99%
“…We added five copies 5 of News-commentary and fully infl. the news subcorpus from Czeng, as well as 5M sentences from the Czeng EU corpus randomly selected after running modified Moore-Lewis filtering with XenC (Rousseau, 2013). The English-to-Latvian systems used all the parallel data provided at WMT'17.…”
Section: Data and Preprocessingmentioning
confidence: 99%