2010
DOI: 10.2478/v10108-010-0015-5
|View full text |Cite
|
Sign up to set email alerts
|

Free/Open-Source Resources in the Apertium Platform for Machine Translation Research and Development

Abstract: Free/Open-Source Resources in the Apertium Platform for Machine Translation Research and DevelopmentThis paper describes the resources available in the Apertium platform, a free/open-source framework for creating rule-based machine translation systems. Resources within the platform take the form of finite-state morphologies for morphological analysis and generation, bilingual transfer lexica, probabilistic part-of-speech taggers and transfer rule files, all in standardised formats. These resources are describe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0
1

Year Published

2010
2010
2020
2020

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 15 publications
(8 citation statements)
references
References 11 publications
0
7
0
1
Order By: Relevance
“…Several databases were used to find the word pairs. The 'apertium' translation database (Tyers, Sánchez-Martínez, Ortiz-Rojas, & Forcada, 2010) was used to define an initial set of Basque-Spanish noun translations. Next, using the citation-form phonological transcription in lexical databases for Spanish ('B-Pal', Davis & Perea, 2005) and Basque ('E-Hitz', Perea et al, 2006), pairs of nouns were chosen such that: the members of each pair had the same number of syllables but distinct initial phonemes, the Levenshtein distance between the CV transcriptions of each pair was less than three, the Levenshtein distance between the phonological transcriptions was greater than three (to avoid cognates), the absolute difference in log 10 frequency was less than 2, both the Spanish and Basque frequencies-per-million were greater than 5, both the Spanish and the Basque words had a noun part-of-speech tag, and that the absolute difference in the number of phonemes was not greater than 1.…”
Section: Materials and Designmentioning
confidence: 99%
“…Several databases were used to find the word pairs. The 'apertium' translation database (Tyers, Sánchez-Martínez, Ortiz-Rojas, & Forcada, 2010) was used to define an initial set of Basque-Spanish noun translations. Next, using the citation-form phonological transcription in lexical databases for Spanish ('B-Pal', Davis & Perea, 2005) and Basque ('E-Hitz', Perea et al, 2006), pairs of nouns were chosen such that: the members of each pair had the same number of syllables but distinct initial phonemes, the Levenshtein distance between the CV transcriptions of each pair was less than three, the Levenshtein distance between the phonological transcriptions was greater than three (to avoid cognates), the absolute difference in log 10 frequency was less than 2, both the Spanish and Basque frequencies-per-million were greater than 5, both the Spanish and the Basque words had a noun part-of-speech tag, and that the absolute difference in the number of phonemes was not greater than 1.…”
Section: Materials and Designmentioning
confidence: 99%
“…For example the UniMorph project (Kirov et al, 2016) extracts and normalizes morphological paradigms from the Wiktionary free online dictionary site. Further, finite state transducers for morphological analysis and generation for a multitude of languages are available in the Apertium framework (Tyers et al, 2010). Both of these resources can be used to collect inflected words and for each word a set of possible lemmas together with the corresponding morphological features.…”
Section: Morphological Transducersmentioning
confidence: 99%
“…For the latter five treebanks with tiny training sample, we trained the tagger and parser in the standard manner, despite the tiny training set size. However, for four of these five languages (Armenian, Buryat, Kazakh and Kurmanji) we used Apertium morphological transducers (Tyers et al, 2010) to artificially extend the lemmatizer training data by including new words from the transducer not present in the original training data (methods are similar to those used with Breton and Faroese, for details see Section 4.1). Naija is parsed using the English-EWT models without any extra processing as it strongly resembles English language and at the same time lacks all resources.…”
Section: Near-zero Resource Languagesmentioning
confidence: 99%