2022
DOI: 10.48550/arxiv.2202.01374
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

mSLAM: Massively multilingual joint pre-training for speech and text

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(24 citation statements)
references
References 0 publications
0
24
0
Order By: Relevance
“…(Right) SoTA results on public speech translation tasks. Results presented are presented as high/middle/low resources languages defined in [20]. Higher is better.…”
Section: Key Findingsmentioning
confidence: 99%
“…(Right) SoTA results on public speech translation tasks. Results presented are presented as high/middle/low resources languages defined in [20]. Higher is better.…”
Section: Key Findingsmentioning
confidence: 99%
“…We can improve upon their method through a combination of speech-to-text ALMs together with LM-based commonsense reasoning. First, we transcribe the audio from all videos with speech-to-text ALMs (Bapna et al 2022) (also called ASR, or automatic speech recognition), using the publicly-available Google Cloud speech-to-text API. 5 Although raw transcripts may be challenging to incorporate into meaningful improvements for video/caption retrieval, we may leverage reasoning capabilities from large LMs in order to usefully harness the transcripts.…”
Section: System Overview: Socratic Video-to-text Retrievalmentioning
confidence: 99%
“…Recently, self-supervised pre-training of large transformer encoders on massive amounts of unlabeled audio data followed by task-specific fine-tuning has emerged as the de-facto approach for achieving state-of-the-art performance on several tasks in spoken language processing. However, popular selfsupervised representation learning (SSL) approaches such as Wav2vec-2.0 [1] and others [2]- [12] learn speech embedding at acoustic frame-level, i.e., for short speech segments of duration 10 to 20 milliseconds.…”
Section: Introductionmentioning
confidence: 99%