2022
DOI: 10.48550/arxiv.2203.05008
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Abstract: Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in textonly corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word rec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 14 publications
0
4
0
Order By: Relevance
“…For training the LMs, each minibatch is sampled 50/50 from the transcripts of acoustic training data with a total of 150M unique transcripts, and text-only data which contains 50B utterances. The text-only data contains textual search queries from the domains of Maps, Qsession, News, Play, and Youtube, and a frequency-based pruning strategy, designed to improve rare word modeling, is implemented to adjust the probability of selecting each query [25]. All acoustic and LM training data is anonymized and adheres to Google AI Principles [26].…”
Section: Methodsmentioning
confidence: 99%
“…For training the LMs, each minibatch is sampled 50/50 from the transcripts of acoustic training data with a total of 150M unique transcripts, and text-only data which contains 50B utterances. The text-only data contains textual search queries from the domains of Maps, Qsession, News, Play, and Youtube, and a frequency-based pruning strategy, designed to improve rare word modeling, is implemented to adjust the probability of selecting each query [25]. All acoustic and LM training data is anonymized and adheres to Google AI Principles [26].…”
Section: Methodsmentioning
confidence: 99%
“…The results highlight the model's competitive performance across these diverse datasets. In a different study, Huang [169] introduces strategies for curating language modelling data to enhance the recognition of rare words without compromising overall performance. These strategies demonstrate substantial impact, leading to an enhanced language model achieving a noteworthy up to 24% relative reduction in Word Error Rate (WER) for sentences containing rare words.…”
Section: Automatic Speech Recognition (Asr)mentioning
confidence: 99%
“…8: A prompt-completion example for Launch-PadGPT [262]: The text following "prompt:" represents MFCC feature values, while "completion:" shows RGB-X tuples. The tuple (245,5,169,1) indicates that the Launchpad keyboard's second button (index 0 for the first) is purple.. Figure taken from [262].…”
Section: Large Audio Models In Musicmentioning
confidence: 99%
“…For the data selection module, in order to select data that is close to the target domain, we adopt the language model based contrastive method introduced in [21]. Contrastive data selection is a well-known technique in the NLP and speech communities, and has been widely used in machine translation [22,23], and ASR [24][25][26]. It computes a domain relevance score for each utterance and filters by a threshold to keep those closest to the target domain.…”
Section: Introductionmentioning
confidence: 99%