ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682903
|View full text |Cite
|
Sign up to set email alerts
|

Acoustically Grounded Word Embeddings for Improved Acoustics-to-word Speech Recognition

Abstract: Direct acoustics-to-word (A2W) systems for end-to-end automatic speech recognition are simpler to train, and more efficient to decode with, than sub-word systems. However, A2W systems can have difficulties at training time when data is limited, and at decoding time when recognizing words outside the training vocabulary. To address these shortcomings, we investigate the use of recently proposed acoustic and acoustically grounded word embedding techniques in A2W systems. The idea is based on treating the final p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
36
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
9

Relationship

2
7

Authors

Journals

citations
Cited by 32 publications
(37 citation statements)
references
References 26 publications
0
36
1
Order By: Relevance
“…2, we use PWCCA to measure similarity between the W2V2 layer representations and various continuous-valued quantities of interest, either (i) from a different layer of the same model (CCA-intra), (ii) from a fine-tuned version of the model (CCA-inter), or (iii) from an external representation. For the third type of analysis we use mel filter bank features (CCA-mel), acoustically grounded word embeddings [31] (cca-agwe) 1 and GloVe word embeddings [32] (cca-glove) as ways to assess the local acoustic, word-level acoustic-phonetic, and word meaning information encoded in the W2V2 representations respectively.…”
Section: Analysis Methodsmentioning
confidence: 99%
“…2, we use PWCCA to measure similarity between the W2V2 layer representations and various continuous-valued quantities of interest, either (i) from a different layer of the same model (CCA-intra), (ii) from a fine-tuned version of the model (CCA-inter), or (iii) from an external representation. For the third type of analysis we use mel filter bank features (CCA-mel), acoustically grounded word embeddings [31] (cca-agwe) 1 and GloVe word embeddings [32] (cca-glove) as ways to assess the local acoustic, word-level acoustic-phonetic, and word meaning information encoded in the W2V2 representations respectively.…”
Section: Analysis Methodsmentioning
confidence: 99%
“…Following previous work [24,25], we use a classification objective as our neural baseline (Fig. 1-a).…”
Section: Phone N-gram Detection Objectivementioning
confidence: 99%
“…Segments can then be efficiently compared by calculating the distance in the embedding space. Given the advantages AWEs have over alignment methods, several AWE models have been proposed [12][13][14][15][16][17][18][19][20][21][22][23]. Many of these are for the supervised setting, using labelled data to train a discriminative model.…”
Section: Introductionmentioning
confidence: 99%