ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414478
|View full text |Cite
|
Sign up to set email alerts
|

How Phonotactics Affect Multilingual and Zero-Shot ASR Performance

Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic mod… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
12
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 14 publications
(32 reference statements)
2
12
0
Order By: Relevance
“…Interestingly, we see that the hybrid system with a weak phonotactic model performs best (68.6%), and the one with a strong phonotactic model performs worst (66.7%). This finding is consistent with our previous hypothesis that learning non-target languages' phonotactics is harmful for a zero-shot ASR system [22] [RQ3]. If the phonotactic model had perfect knowledge of the phonotactics of the target language, the F1-score raises to 81.8%, which is mostly due to a perfect precision score of 100%.…”
Section: Phonetic Inventory Discoverysupporting
confidence: 91%
See 4 more Smart Citations
“…Interestingly, we see that the hybrid system with a weak phonotactic model performs best (68.6%), and the one with a strong phonotactic model performs worst (66.7%). This finding is consistent with our previous hypothesis that learning non-target languages' phonotactics is harmful for a zero-shot ASR system [22] [RQ3]. If the phonotactic model had perfect knowledge of the phonotactics of the target language, the F1-score raises to 81.8%, which is mostly due to a perfect precision score of 100%.…”
Section: Phonetic Inventory Discoverysupporting
confidence: 91%
“…So, despite lack of explicit constraint on the phonotactic information, we see an increase in F1 scores for the E2E ASR multi models. This is consistent with the hypothesis of [22] that the encoder-decoder transformer systems are learning a representation of the target languages' phonotactics [RQ3]. Finally, the best result of 90.6% F1score is obtained by a fully monolingual hybrid system with a word-level trigram LM, likely due to its stronger phonotactic constraint imposed by the presence of a decoding graph.…”
Section: Phonetic Inventory Discoverysupporting
confidence: 85%
See 3 more Smart Citations