2020
DOI: 10.1109/taslp.2020.2973896
|View full text |Cite
|
Sign up to set email alerts
|

Speech Technology for Unwritten Languages

Abstract: Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speechto-text and text-to-speech subsystems. The research presented in this paper takes the first steps towards speech technology for unwritten languages.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
20
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

5
3

Authors

Journals

citations
Cited by 21 publications
(20 citation statements)
references
References 38 publications
0
20
0
Order By: Relevance
“…Typically, a well-trained deep neural network (DNN) based ASR system [2] requires hundreds to thousands of hours of transcribed speech data. This requirement can be challenging to most of the world's spoken languages that are considered low-resourced [3]. Most languages, including ethnic minority languages in most countries outside Europe, lack transcribed speech data, digitized texts, and digitized pronunciation lexicons.…”
Section: Introductionmentioning
confidence: 99%
“…Typically, a well-trained deep neural network (DNN) based ASR system [2] requires hundreds to thousands of hours of transcribed speech data. This requirement can be challenging to most of the world's spoken languages that are considered low-resourced [3]. Most languages, including ethnic minority languages in most countries outside Europe, lack transcribed speech data, digitized texts, and digitized pronunciation lexicons.…”
Section: Introductionmentioning
confidence: 99%
“…Recent research on VGS models focused on architectural and training scheme improvements [5], [6], [11]and applications such as semantic keyword spotting [7], [12] and speech-based image retrieval [3], [5], [6], [8]. Recent research has shown that VGS models implicitly learn to recognise meaningful sentence constituents such as phonemes and words and the presence of these constituents can be decoded from the speech embeddings [5], [6], [13]- [15].…”
Section: Introductionmentioning
confidence: 99%
“…Instead, the supervisory information consists of the corresponding images of the speech descriptions. Inspired by human infants' ability to learn spoken language by listening and paying attention to the concurrent speech and visual scenes, several recent methods [15], [36], [37], [38], [39], [40] have been proposed to learn speech models grounded by visual information.…”
Section: B Visually-grounded Speech Embedding Learningmentioning
confidence: 99%
“…With speech as input, the trained speech encoder is used to extract the speech embeddings. Different from most of the previously described visually-grounded speech embedding learning methods in which the traditional triplet loss function was adopted to train the model [15], [36], [37], [39], [40], in this work we propose a more effective matching loss which is similar to the masked margin softmax loss [38] that can make full use of negative samples within a minibatch. Furthermore, previous studies only focused on databases with scene images, primarily Flickr8k [16], while no finegrained image databases, e.g., CUB-200 [12], in which images from different categories share similar semantics, have been considered.…”
Section: B Visually-grounded Speech Embedding Learningmentioning
confidence: 99%