2017
DOI: 10.48550/arxiv.1710.03501
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

P. Godard,
G. Adda,
M. Adda-Decker
et al.

Abstract: Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources and some even lack a stable orthography. Building systems under these almost zero resource conditions is not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 14 publications
0
14
0
Order By: Relevance
“…The same datasets of [9] will be used: a corpus in Kunwinjku which consists of 301 utterances aligned with an orthographic transcription and a forced alignment created using the MAUS forced aligner [20], and a corpus in Mboshi which consists of 5130 utterances elicited from text, with orthographic transcription and a forced alignment at the word level [21].…”
Section: Datamentioning
confidence: 99%
“…The same datasets of [9] will be used: a corpus in Kunwinjku which consists of 301 utterances aligned with an orthographic transcription and a forced alignment created using the MAUS forced aligner [20], and a corpus in Mboshi which consists of 5130 utterances elicited from text, with orthographic transcription and a forced alignment at the word level [21].…”
Section: Datamentioning
confidence: 99%
“…In this section, we evaluate generalization of multilingual ST models by performing transfer learning to a very low-resource ST task. We used Mboshi-French corpus 14 [22], which contains 4.4-hours of spoken utterances and the corresponding Mboshi transcriptions and French translations. Mboshi [49] is a Bantu C25 language spoken in Congo-Brazzaville and does not have standard orthography.…”
Section: Pre-training With the Asr Encodermentioning
confidence: 99%
“…We evaluate one-to-many (O2M) and many-to-many (M2M) translations by combining these corpora and confirm significant improvements by multilingual training in both scenarios. Next, we evaluate the generalization of multilingual E2E-ST models by performing transfer learning to a very low-resource ST task: Mboshi (Bantu C25)→Fr corpus (4.4 hours) [22]. We show that multilingual pretraining of the seed E2E-ST models improves the performance in the low-resource language pair unseen during training, compared to bilingual pre-training.…”
Section: Introductionmentioning
confidence: 98%
“…Mboshi-French: Mboshi (Bantu C25 in the Guthrie classification) is a language spoken in Congo-Brazzaville, without standard orthography. We use a corpus of 5517 parallel utterances (about 4.4 hours of audio) collected from three native speakers using the LIG-Aikuma app for the BULB project [5,7]. The corpus provides non-standard grapheme transcriptions (produced by linguists to be close to the language phonology) as well as French translations.…”
Section: Datamentioning
confidence: 99%
“…One is Ainu, a severely endangered language, with translations in English. We also experiment on a recently collected speech corpus of Mboshi [7], with translations in French.…”
Section: Introductionmentioning
confidence: 99%