Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop

Scharenborg, Odette; Besacier, Laurent; Black, Alan W.; Hasegawa‐Johnson, Mark; Metze, Florian; Neubig, Graham; Stüker, Sebastian; Godard, Pierre; Müller, Markus; Ondel, Lucas; Palaskar, Shruti; Arthur, Philip; Ciannella, Francesco; Du, Mingxing; Larsen, Elin; Merkx, Danny; Riad, Rachid; Wei, Liming; Dupoux, Emmanuel

doi:10.1109/icassp.2018.8461761

Cited by 34 publications

(29 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In (Kamper et al) a dynamic time warping alignment is used to discover similar segment pairs. Our work is inspired by the research efforts in reducing the dependence on labeled data for building ASR systems through unsupervised unit discovery and acoustic representation leaning (Park and Glass, 2008;Glass;et. al., a,f), and through multiand cross-lingual transfer learning in low-resource conditions (et.…”

Section: Discussion and Related Workmentioning

confidence: 99%

“…background noise, recording channel, speaker identity, accent, emotional state, topic under discussion, and the language used in communication. The practical need for building ASR systems for new conditions with limited resources spurred a lot of work focused on unsupervised speech recognition and representation learning (Park and Glass, 2008;Glass;et. al., a,f;van den Oord et al, 2018;, in addition to semiand weakly-supervised learning techniques aiming at reducing the supervised data needed in realworld scenarios (Vesely et al;Li et al, b;Krishnan Parthasarathi and Strom;Chrupała et al;Kamper et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Effectiveness of Self-Supervised Pre-Training for ASR

Baevski

Mohamed

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

131

View full text Add to dashboard Cite

We present pre-training approaches for selfsupervised representation learning of speech data. A BERT, masked language model, loss on discrete features is compared with an InfoNCE-based constrastive loss on continuous speech features. The pre-trained models are then fine-tuned with a Connectionist Temporal Classification (CTC) loss to predict target character sequences. To study impact of stacking multiple feature learning modules trained using different self-supervised loss functions, we test the discrete and continuous BERT pre-training approaches on spectral features and on learned acoustic representations, showing synergitic behaviour between acoustically motivated and masked language model loss functions. In low-resource conditions using only 10 hours of labeled data, we achieve Word Error Rates (WER) of 10.2% and 23.5% on the standard test "clean" and "other" benchmarks of the Librispeech dataset, which is almost on bar with previously published work that uses 10 times more labeled data. Moreover, compared to previous work that uses two models in tandem (Baevski et al., 2019b), by using one model for both BERT pre-trainining and fine-tuning, our model provides an average relative WER reduction of 9%. 1

show abstract

Section: Discussion and Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Effectiveness of Self-Supervised Pre-Training for ASR

Baevski

Mohamed

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

131

View full text Add to dashboard Cite

show abstract

“…Several recent studies have trained models on images paired with unlabelled speech [4][5][6][22][23][24][25][26]. Most approaches map images and speech into a common space, allowing images to be retrieved using speech and vice versa.…”

Section: Related Workmentioning

confidence: 99%

Predicting the Features of World Atlas of Language Structures from Speech

Gutkin

Merkulova

Jansche

2018

6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)

View full text Add to dashboard Cite

Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P @10 to 91%.

show abstract

“…The speech and images are then projected into the same "semantic" space. The DNN then learns to associate 1 Note, a summary and initial results of this work were presented in [59], also available in the HAL repository: https://hal.archives-ouvertes.fr/hal-01709578/document. The current paper provides more details on the experimental setups of the experiments, including more details on the used Deep Neural Network architectures and algorithms and rationales for the experiments.…”

Section: Introductionmentioning

confidence: 99%

Speech Technology for Unwritten Languages

Scharenborg

Ondel

Palaskar

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speechto-text and text-to-speech subsystems. The research presented in this paper takes the first steps towards speech technology for unwritten languages. Specifically, the aim of this work was 1) to learn speech-to-meaning representations without using text as an intermediate representation, and 2) to test the sufficiency of the learned representations to regenerate speech or translated text, or to retrieve images that depict the meaning of an utterance in an unwritten language. The results suggest that building systems that go directly from speech-to-meaning and from meaning-tospeech, bypassing the need for text, is possible.

show abstract

Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop

Cited by 34 publications

References 3 publications

Effectiveness of Self-Supervised Pre-Training for ASR

Effectiveness of Self-Supervised Pre-Training for ASR

Predicting the Features of World Atlas of Language Structures from Speech

Speech Technology for Unwritten Languages

Contact Info

Product

Resources

About